In the early 2000s, the New York City Police Department was required to publicly report data on their stop-and-frisk program. This program allows officers to temporarily stop, question, and potentially frisk/search civilians they suspect may have committed a crime without needing evidence or a warrant. The stops may result in release, arrest, or criminal summons. As civilians can be stopped without evidence, we want to explore what variables are the most accurate prediction of the stop resulting in arrest. In order to predict arrests, we will use classification methods. WRITE MORE ABOUT METHODS, FINDINGS, and CONCLUSIONS
The goal of our project is to predict Y. edit, include key findings
For over 20 years, civilians in NYC may be stopped, question, and searched by the police based solely on suspicion without any evidence. From 2002 – 2012, under Mayor Bloomberg’s administration, stops were more rampant, reaching a peak in 2011 with 685,724 reported stops (NYCLU 2024). With changes to administrations, the number of stops has decreased in recent years with 15,102 and 16,791 stops in 2022 and 2023 respectively. The stop-and-frisk program has been highly criticized as it does not require evidence to stop individuals and may allow police to use racial profiling tactics to target individuals. For example, the American Civil Liberties Union of New York (NYCLU) has criticized the program as their investigation of the data found a disproportionate number of civilians stopped are people of color, in particular Black civilians, and some stops have resulted in police misconduct. As the stops do not require previous evidence or warrants, we are interested in predicting which stops result in arrest, particularly focusing on factors gathered before the stop begins that may predict arrest, such as demographic characteristics of civilians, location, time of day, and attributes of the police officers.
Target variable: Our goal is to predict
arrested_flag, which is a binary indicator which equals 1
if the individual who was stopped was arrested, and 0 otherwise.1
Motivation: edit here re relevance, cite some literature
Relevant features: edit
Methods: edit
We use the New York Police Department Stop, Question and Frisk Data from 2023.
This is a stop-level dataset; each observation (row) corresponds to a unique stop made by an NYPD police officer in 2023 as part of the SQF programme.
How data is collected & what universe is/is not observed (some note here on the forms they have to fill out to record the stop, refer Precinct and Prejudice paper on why some stops may not actually be recorded)
Temporal span: the entire year of 2023.
Spatial span: the 77 NYPD police precincts covering all 5 boroughs of NYC.
Dimension: The data has 16971 observations and 82 features, as shown below.
We begin the project by installing and loading in the necessary libraries.
# Install and load required packages
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
library(pacman)
p_load(readxl, dplyr, ggplot2, knitr, lubridate, tidyr, sf, httr, caret, glmnet, stringr, remotes, RColorBrewer, viridis, scales, classInt, forcats, pROC, randomForest)
We proceed by setting up the file path and importing the data from our project’s GitHub repository.
# get raw content of the file
response <- GET("https://raw.githubusercontent.com/rrachelkane/data-science-group-project/main/data/sqf-2023.xlsx")
# retrieve the .xlsx file
if (status_code(response) == 200) {
# create a temporary file to save the downloaded content
temp_file <- tempfile(fileext = ".xlsx")
# Write the raw content to the temporary file
writeBin(content(response, "raw"), temp_file)
# Read the Excel file from the temporary file
sqf_data <- read_xlsx(temp_file)
# View the first few rows of the data
head(sqf_data)
} else {
stop("Failed to download the file.")
}
## # A tibble: 6 × 82
## STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2 DAY2 STOP_WAS_INITIATED
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1 2023-01-01 00:44:00 2023 January Sund… Based on Radio Run
## 2 2 2023-01-01 00:49:00 2023 January Sund… Based on Self Ini…
## 3 3 2023-01-01 05:31:00 2023 January Sund… Based on Radio Run
## 4 4 2023-01-01 04:59:00 2023 January Sund… Based on Self Ini…
## 5 5 2023-01-01 05:21:00 2023 January Sund… Based on Self Ini…
## 6 6 2023-01-01 18:00:00 2023 January Sund… Based on Radio Run
## # ℹ 75 more variables: RECORD_STATUS_CODE <chr>, ISSUING_OFFICER_RANK <chr>,
## # ISSUING_OFFICER_COMMAND_CODE <chr>, SUPERVISING_OFFICER_RANK <chr>,
## # SUPERVISING_OFFICER_COMMAND_CODE <chr>,
## # SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED <chr>,
## # LOCATION_IN_OUT_CODE <chr>, JURISDICTION_CODE <chr>,
## # JURISDICTION_DESCRIPTION <chr>, OBSERVED_DURATION_MINUTES <dbl>,
## # SUSPECTED_CRIME_DESCRIPTION <chr>, STOP_DURATION_MINUTES <dbl>, …
# check original dimensions
dim(sqf_data)
## [1] 16971 82
# view head
head(sqf_data)
## # A tibble: 6 × 82
## STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2 DAY2 STOP_WAS_INITIATED
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1 2023-01-01 00:44:00 2023 January Sund… Based on Radio Run
## 2 2 2023-01-01 00:49:00 2023 January Sund… Based on Self Ini…
## 3 3 2023-01-01 05:31:00 2023 January Sund… Based on Radio Run
## 4 4 2023-01-01 04:59:00 2023 January Sund… Based on Self Ini…
## 5 5 2023-01-01 05:21:00 2023 January Sund… Based on Self Ini…
## 6 6 2023-01-01 18:00:00 2023 January Sund… Based on Radio Run
## # ℹ 75 more variables: RECORD_STATUS_CODE <chr>, ISSUING_OFFICER_RANK <chr>,
## # ISSUING_OFFICER_COMMAND_CODE <chr>, SUPERVISING_OFFICER_RANK <chr>,
## # SUPERVISING_OFFICER_COMMAND_CODE <chr>,
## # SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED <chr>,
## # LOCATION_IN_OUT_CODE <chr>, JURISDICTION_CODE <chr>,
## # JURISDICTION_DESCRIPTION <chr>, OBSERVED_DURATION_MINUTES <dbl>,
## # SUSPECTED_CRIME_DESCRIPTION <chr>, STOP_DURATION_MINUTES <dbl>, …
First, we change column names from strictly upper case to strictly lower case, because it’s cuter.
colnames(sqf_data) <- tolower(colnames(sqf_data))
# check
colnames(sqf_data)[1:3]
## [1] "stop_id" "stop_frisk_date" "stop_frisk_time"
Next, we drop all columns which cannot be used for our prediction
question, as they are realized after the outcome of
interest, namely arrested_flag, or are irrelevant for other
reasons. We drop spatial features other than x and y coordinates of the
stop, as this is the most granular spatial information and we use this
to map each stop to census tracts to obtain demographic information
corresponding to the location of the stop.These dropped spatial features
also have high cardinality, which would add many dummies to our model,
undermining computational efficiency.
sqf_data <- sqf_data %>%
select(- c("stop_frisk_date", "record_status_code", "supervising_action_corresponding_activity_log_entry_reviewed", "stop_location_sector_code", "stop_location_apartment", "stop_location_full_address", "stop_location_patrol_boro_name", "stop_location_street_name", "suspect_other_description", "observed_duration_minutes", "stop_duration_minutes", "summons_issued_flag", "supervising_officer_command_code", "issuing_officer_command_code", "stop_location_precinct"))
# check new dim
dim(sqf_data)
## [1] 16971 67
First, we note that there is only 1 column with any instance of an
NA value.
na_cols <- colMeans(is.na(sqf_data)) * 100
na_cols[na_cols > 0]
## demeanor_of_person_stopped
## 15.01385
The process generating the missingness of
demeanor_of_person_stopped is unclear. Imputation of this
would be difficult, so we drop this column.
# drop
sqf_data <- sqf_data %>%
select(-("demeanor_of_person_stopped"))
# check new dim
dim(sqf_data)
## [1] 16971 66
Additionally, there are many observations in the data with values ==
(null) across different columns.
# get % of nulls, in columns with at least one null
null_cols <- (colMeans(sqf_data == "(null)") * 100)[colMeans(sqf_data == "(null)") * 100 > 0]
# make df for plot
null_cols_df <- data.frame(Feature = names(null_cols), Percentage = null_cols)
dim(null_cols_df)
## [1] 48 2
# order for plot
null_cols_df$Feature <- factor(null_cols_df$Feature,
levels = null_cols_df$Feature[order(null_cols_df$Percentage, decreasing = FALSE)])
# plot
ggplot(null_cols_df, aes(x = Feature, y = Percentage)) +
geom_bar(stat = "identity", fill = "lightblue", color = "darkblue") +
labs(title = "Percentage of (null) Values per Column",
x = "Columns",
y = "Percentage of (null) Values") +
coord_flip() + # Flip coordinates
theme_minimal()
Note, however, that not all of these (null) observations
are equivalent:
(null) values, (null) means the data are
genuinely effectively NA, as there are
instances of both “Y” and “N” (for binary variable for example),
alongside (null).sqf_data %>%
group_by(ask_for_consent_flg) %>%
summarise(N = n()) %>%
kable()
| ask_for_consent_flg | N |
|---|---|
| (null) | 779 |
| N | 14187 |
| Y | 2005 |
null values are actually
used as “N”.print(unique(sqf_data$firearm_flag))
## [1] "(null)" "Y"
sqf_data %>%
group_by(weapon_found_flag, firearm_flag) %>%
summarise(N = n()) %>%
kable()
## `summarise()` has grouped output by 'weapon_found_flag'. You can override using
## the `.groups` argument.
| weapon_found_flag | firearm_flag | N |
|---|---|---|
| N | (null) | 14310 |
| N | Y | 28 |
| Y | (null) | 1495 |
| Y | Y | 1138 |
Note here that even though a firearm_flag has a
"Y" entry and weapon_found_flag has a
"N" entry, this is not necessarily incorrect, as the
officer can have identified a firearm without having to carry out a
frisk nor a `search.
We deal with these cases of (null) separately:
(null) with
"N" values# initialize empty vector
null_2 <- c()
# loop through columns
for (col in names(sqf_data)) {
# get unique values of the col
unique_values <- unique(sqf_data[[col]])
# if the unique values are exactly "Y" and "(null)"
if (all(unique_values %in% c("Y", "(null)")) && length(unique_values) == 2) {
null_2 <- c(null_2, col) # add column name to null_2
}
}
# check n of type 2 nulls
length(null_2)
## [1] 27
# pre-clean check
print(unique(sqf_data$firearm_flag))
## [1] "(null)" "Y"
# replace these nulls with Ns
sqf_data <- sqf_data %>%
mutate(across(all_of(null_2), ~ ifelse(. == "(null)", "N", .)))
# post-clean check
print(unique(sqf_data$firearm_flag))
## [1] "N" "Y"
NA values:# initialize empty vector
null_1 <- c()
# loop through columns
for (col in names(sqf_data)) {
# for columns not in null_2
if (!(col %in% null_2)) {
# if "(null)" is present in the column
if ("(null)" %in% sqf_data[[col]]) {
null_1 <- c(null_1, col) # add column name to the vector
}
}
}
# check length
length(null_1)
## [1] 21
# pre-clean check
print(unique(sqf_data$ask_for_consent_flg))
## [1] "N" "Y" "(null)"
# replace these with NAs
sqf_data <- sqf_data %>%
mutate(across(all_of(null_1), ~ ifelse(. == "(null)", NA, .)))
# post-clean check
print(unique(sqf_data$ask_for_consent_flg))
## [1] "N" "Y" NA
Now, we percentage of actual missing values, correctly identified by
"NA":
# get % of NAs, in columns with at least one NA
na_cols <- (colMeans(is.na(sqf_data)) * 100)[colMeans(is.na(sqf_data)) * 100 > 0]
# make df for plot
na_cols_df <- data.frame(Feature = names(na_cols), Percentage = na_cols)
# order for plot
na_cols_df$Feature <- factor(na_cols_df$Feature,
levels = na_cols_df$Feature[order(na_cols_df$Percentage, decreasing = FALSE)])
# plot
ggplot(na_cols_df, aes(x = Feature, y = Percentage)) +
geom_bar(stat = "identity", fill = "#F8566D", color = "black") +
labs(title = "Percentage of NA Values per Column",
x = "Columns",
y = "Percentage of NA Values") +
coord_flip() + # Flip coordinates
theme_minimal()
Given the above, we
sqf_data <- sqf_data %>%
select(-all_of(names(na_cols[na_cols > 25])))
dim(sqf_data)
## [1] 16971 57
sqf_data <- sqf_data %>%
filter(!if_any(everything(), is.na))
dim(sqf_data)
## [1] 12095 57
We change the coding of binary variables from "Y" and
"N":
# pre check
print(unique(sqf_data$frisked_flag))
## [1] "Y" "N"
# clean Ys and Ns and set as numeric
sqf_data <- sqf_data %>%
mutate(across(
where(~ all(. %in% c("Y", "N"))),
~ as.numeric(ifelse(. == "Y", 1, ifelse(. == "N", 0, NA))) #
))
# post check
print(unique(sqf_data$frisked_flag))
## [1] 1 0
We also bin the time of the stop from
stop_frisk_time:
sqf_data <- sqf_data %>%
mutate(
time_of_day = case_when(
str_extract(stop_frisk_time, "^\\d{2}") %in% c("06", "07", "08", "09", "10", "11") ~ "Morning",
str_extract(stop_frisk_time, "^\\d{2}") %in% c("12", "13", "14", "15", "16", "17") ~ "Afternoon",
str_extract(stop_frisk_time, "^\\d{2}") %in% c("18", "19", "20", "21", "22", "23") ~ "Evening",
str_extract(stop_frisk_time, "^\\d{2}") %in% c("00", "01", "02", "03", "04", "05") ~ "Night",
TRUE ~ NA_character_ # Handle unexpected or missing values
),
time_of_day = factor(time_of_day, levels = c("Morning", "Afternoon", "Evening", "Night")) # Ensure factor order
)
# Verify the result
print(table(sqf_data$time_of_day))
##
## Morning Afternoon Evening Night
## 909 2753 5626 2807
# drop stop frisk time as we will just use time_of_day
sqf_data <- sqf_data %>%
select(-"stop_frisk_time")
Note - do we want to keep stop hour as well? or just use time of day? -Lucas Answer: I would just use time of day but I would extend shift night along one hour and shorten morning/lengthen evening (or make them uniform blocks of 6 hours)
We also convert other relevant variables to factors as needed:
# convert character columns to factors, except for stop location x and y
sqf_data <- sqf_data %>%
mutate(across(
.cols = where(is.character) & !c("stop_location_x", "stop_location_y"),
.fns = as.factor
))
Next, we will make suspect_height,
suspect_weight, and suspect_reported_age
numeric keeping in mind the factor levels to ensure R converts them
correctly. Then, we will address outliers in the data.
library(dplyr)
# Function to safely convert factor to numeric, handling non-numeric entries
convert_to_numeric <- function(x) {
# Convert to character to avoid factor levels issues
x <- as.character(x)
# Replace non-numeric values with NA (e.g., "unknown", "760", "7.6")
x <- gsub("[^0-9.]", "", x)
# Convert to numeric
as.numeric(x)
}
# Apply the function to the relevant columns
sqf_data <- sqf_data %>%
mutate(
suspect_reported_age = convert_to_numeric(suspect_reported_age),
suspect_height = convert_to_numeric(suspect_height),
suspect_weight = convert_to_numeric(suspect_weight)
)
#Check to make sure successful
summary(sqf_data$suspect_reported_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 19.00 25.00 27.77 34.00 118.00
summary(sqf_data$suspect_height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 5.400 5.700 5.649 5.900 7.600
summary(sqf_data$suspect_weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 150.0 160.0 166.3 180.0 760.0
Let’s plot the density of age to see the distribution. Since we have some outliers, we are going to drop any observations where age is below 10 and above 85.
# Compute density for reported_age
reported_age_density <- density(sqf_data$suspect_reported_age, na.rm = TRUE)
# Plot the density
plot(reported_age_density,
main = "Density of Reported Age with Outliers Highlighted",
xlab = "Reported Age",
ylab = "Density",
col = "black",
lwd = 2)
grid()
# Identify and highlight outliers (ages < 10 and > 85)
outliers <- sqf_data$suspect_reported_age[sqf_data$suspect_reported_age < 10 | sqf_data$suspect_reported_age > 85]
# Add vertical lines to mark the outlier boundaries
abline(v = c(10, 85), col = "deeppink", lty = 2, lwd = 2)
# Highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "deeppink", pch = 19)
#We will drop outlier observations where age is above 85 and below 10.
sqf_data <- sqf_data[sqf_data$suspect_reported_age >= 10 & sqf_data$suspect_reported_age <= 85, ]
We plot the density of height to see the distribution and identify outliers. Height appears to be measured in feet and we are going to drop observations below 4 ft and above 7 ft.
# Compute density for suspect_height
height_density <- density(sqf_data$suspect_height, na.rm = TRUE)
# Plot the density
plot(height_density,
main = "Density of Height with Outliers Highlighted",
xlab = "Height",
ylab = "Density",
col = "black",
lwd = 2)
grid()
# Identify and highlight outliers (e.g., heights below 4 feet and above 7 feet)
outliers <- sqf_data$suspect_height[sqf_data$suspect_height < 4 | sqf_data$suspect_height > 7]
# Add vertical lines to mark the outlier boundaries
abline(v = c(4, 7), col = "darkorchid2", lty = 2, lwd = 2)
# Highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "darkorchid2", pch = 19)
#We will drop outlier observations where height is above 7 ft and below 4 ft.
sqf_data <- sqf_data[sqf_data$suspect_height >= 4 & sqf_data$suspect_height <= 7, ]
We plot the density of weight to identify outliers. The weight variable appears to be measured in pounds and it appears that some observations could have data entry errors (they are nonsensical). We will drop outliers above 300 lbs and below 90 lbs.
# Compute density for suspect_weight
weight_density <- density(sqf_data$suspect_weight, na.rm = TRUE)
# Plot the density
plot(weight_density,
main = "Density of Suspect Weight with Outliers Highlighted",
xlab = "Weight",
ylab = "Density",
col = "black",
lwd = 2)
grid()
# Identify and highlight outliers (e.g., weights below 90 lbs and above 300 lbs)
outliers <- sqf_data$suspect_weight[sqf_data$suspect_weight < 90 | sqf_data$suspect_weight > 300]
# Add vertical lines to mark the outlier boundaries
abline(v = c(90, 300), col = "darkturquoise", lty = 2, lwd = 2)
# Highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "darkturquoise", pch = 19)
#We will drop outlier observations where height is above 300 lbs. and below 90 lbs.
sqf_data <- sqf_data[sqf_data$suspect_weight >= 90 & sqf_data$suspect_weight <= 300, ]
Finally, we will standardize our numerical variables as they use different scales.
# Standardize the numeric variables
sqf_data <- sqf_data %>%
mutate(
suspect_reported_age = scale(suspect_reported_age),
suspect_height = scale(suspect_height),
suspect_weight = scale(suspect_weight)
)
# Check if the standardization worked by summarizing the variables
summary(sqf_data$suspect_reported_age)
## V1
## Min. :-1.5034
## 1st Qu.:-0.7427
## Median :-0.2355
## Mean : 0.0000
## 3rd Qu.: 0.5252
## Max. : 4.3288
summary(sqf_data$suspect_height)
## V1
## Min. :-4.4452
## 1st Qu.:-0.6744
## Median : 0.1337
## Mean : 0.0000
## 3rd Qu.: 0.6724
## Max. : 3.6352
summary(sqf_data$suspect_weight)
## V1
## Min. :-2.4893
## 1st Qu.:-0.5337
## Median :-0.2078
## Mean : 0.0000
## 3rd Qu.: 0.4440
## Max. : 4.3551
# check dim again
dim(sqf_data)
## [1] 12044 57
# tabulation of the dependent variable
sqf_data %>%
group_by(suspect_arrested_flag) %>%
summarise(N = n(),
Pc = N / nrow(sqf_data) * 100) %>%
arrange(desc(N)) %>%
kable(booktabs = TRUE, col.names = c("Suspect Arrested", "N Stops", "% Total Stops"), align = "l")
| Suspect Arrested | N Stops | % Total Stops |
|---|---|---|
| 0 | 8270 | 68.6649 |
| 1 | 3774 | 31.3351 |
# looking at the distribution of sex
ggplot(sqf_data, aes(x = suspect_sex, fill = suspect_sex)) +
geom_bar() +
labs(
title = "Distribution of Suspect Sex",
x = "Sex",
y = "Count"
) +
theme_minimal() +
scale_fill_manual(
values = c("MALE" = "lightblue", "FEMALE" = "pink")
) +
theme(legend.position = "none")
# sex by arrest status
ggplot(sqf_data, aes(x = suspect_sex, fill = factor(suspect_arrested_flag))) +
geom_bar(position = "fill") +
labs(title = "Distribution of Suspect Sex",
x = "Sex",
y = "Count") +
theme_minimal() +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(type = "qual", palette = "Pastel2", name = "Suspect Arrested")
# empirical cdf of age by sex and arrest status
ggplot(sqf_data, aes(x = suspect_reported_age, color = factor(suspect_arrested_flag))) +
stat_ecdf(geom = "step") +
facet_wrap(~ suspect_sex, ncol = 2) +
scale_color_manual(values = c("0" = "red", "1" = "darkgreen"),
labels = c("Not Arrested", "Arrested"),
name = "Arrest Outcome") +
labs(x = "Suspect Reported Age", y = "ECDF", title = "Empirical CDF of Suspect Reported Age, By Sex and Arrest Status") +
theme_minimal()
# distribution of race
ggplot(sqf_data, aes(x = suspect_race_description, fill = factor(suspect_race_description))) +
geom_bar() +
labs(
title = "Distribution of Suspect Race",
x = "Race",
y = "Count"
) +
theme_minimal() +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)
# arrests by race, unstacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspect_race_description)), fill = factor(suspect_arrested_flag))) +
geom_bar() +
coord_flip() +
theme_minimal() +
xlab("Suspect Race") +
ylab("N Observations") +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
labs(title = "Suspect Arrested, By Race")
# arrests by race, stacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspect_race_description)), fill = factor(suspect_arrested_flag))) +
geom_bar(position = "fill") +
coord_flip() +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
xlab("Suspect Race") +
ylab("% Observations") +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
labs(title = "Suspect Arrested, By Race")
# arrests by suspected crime description, unstacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspected_crime_description)), fill = factor(suspect_arrested_flag))) +
geom_bar() +
coord_flip() +
theme_minimal() +
xlab("Suspected Crime") +
ylab("N Observations") +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
labs(title = "Suspect Arrested, By Suspected Crime")
# arrests by suspected crime description, stacked
ggplot(data = sqf_data, aes(x = suspected_crime_description, fill = factor(suspect_arrested_flag))) +
geom_bar(position = "fill") +
coord_flip() +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
xlab("Suspected Crime") +
ylab("% Observations") +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
labs(title = "Suspect Arrested, By Suspected Crime")
# to be done:
# time of stop
# cop side stuff
# any other variables of importance
# cormat/heatmaps/PCA
# look for outliers here!
# generate features as needed
## Spatial Data Visualisation
We first clean the data for spatial mapping using the `sf` and `nycgeo` packages.
We use this to obtain information about the stops at the census tract level, due to its granularity and the availability of population statistics at this level. **insert why relevant for prediction**
``` r
# drop 7 observations which have incorrect spatial info
sqf_data <- sqf_data %>%
filter(stop_location_x > 0)
dim(sqf_data)
## [1] 12037 57
# make spatial object for mapping
sqf_data_sf <- st_as_sf(sqf_data,
coords = c("stop_location_x", "stop_location_y"),
crs = 2263) # crs for New York (EPSG:2263)
# load in nta-level shapefile
remotes::install_github("mfherman/nycgeo")
## Skipping install of 'nycgeo' from a github remote, the SHA1 (4fee55c1) has not changed since last install.
## Use `force = TRUE` to force installation
library(nycgeo)
nyc_tract_shp <- nycgeo::nyc_boundaries(geography = "tract", add_acs_data = TRUE)
## old-style crs object detected; please recreate object with a recent sf::st_crs()
# check crs
st_crs(nyc_tract_shp)$epsg
## [1] 2263
# plot data onto shapefile by arrest status
ggplot() +
geom_sf(data = nyc_tract_shp, fill = "lightblue", color = "black", size = 0.3) +
geom_sf(data = sqf_data_sf, aes(color = as.factor(suspect_arrested_flag)), size = 0.7) +
scale_color_manual(values = c("red", "seagreen"),
labels = c("Arrested", "Not Arrested")) +
theme_minimal() +
labs(title = "NYC Police Stops by Arrest Status") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.title = element_blank())
Perhaps it is more informative to view the percentage of stops ending in arrest at the tract level:
# join datasets to assign each stop to a tract
sqf_data_sf <- st_join(sqf_data_sf, nyc_tract_shp)
dim(sqf_data_sf)
## [1] 12037 102
# aggregate to tract level
sqf_data_sf_tract_level <- sqf_data_sf %>%
filter(!is.na(geoid)) %>%
group_by(geoid) %>%
summarize(pc_arrest = (sum(suspect_arrested_flag) / n()) * 100)
# join with shp for mapping
sqf_data_sf_tract_level <- nyc_tract_shp %>%
st_join(sqf_data_sf_tract_level, by = "geoid")
ggplot() +
geom_sf(data = sqf_data_sf_tract_level, aes(fill = pc_arrest), color = "black", size = 0.3) +
scale_fill_viridis_c(
name = "% Stops Ending in Arrest",
option = "inferno",
na.value = "white"
) +
theme_void() +
labs(title = "Percentage of Stops Ending in Arrest by NYC Census Tract")
Additionally, the nycgeo package allows us to link with
neighbour tabulation area level American Community Survey (2013-2017)
population statistics.
insert justification for why these - demographic stats of location of the stop - might be predictors of arrest
We visualize those here:
# non hispanic black
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pop_black_est)) +
scale_fill_viridis_c(
name = "Non-Hispanic Black Population",
option = "inferno"
) +
theme_void() +
labs(title = "Non-Hispanic Black Population by Census Tract, ACS 2013-2017")
# hispanic any
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pop_hisp_est)) +
scale_fill_viridis_c(
name = "Hispanic Any Race Population",
option = "inferno"
) +
theme_void() +
labs(title = "Hispanic Any Race Population by Census Tract, ACS 2013-2017")
# non
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pop_asian_est)) +
scale_fill_viridis_c(
name = "Non-hispanic Asian Population",
option = "inferno"
) +
theme_void() +
labs(title = "Non-hispanic Asian Population by Census Tract, ACS 2013-2017")
# pop age 25 years or older with at least bachelors degree
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pop_ba_above_est)) +
scale_fill_viridis_c(
name = "Population Aged >= 25 with at Least Bachelors Degree",
option = "inferno"
) +
theme_void() +
labs(title = "Population Aged >= 25 with at least a Bachelor's Degree by Census Tract, ACS 2013-2017")
# income below pov
# pop age 25 years or older with at least bachelors degree
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pop_inpov_est)) +
scale_fill_viridis_c(
name = "Population With Income Below Poverty Line",
option = "inferno"
) +
theme_void() +
labs(title = "Population with Income Below Poverty Line, ACS 2013-2017")
We keep only these predictors from the ACS data as predictors for our analysis, as shown below.
# check current dim
dim(sqf_data)
## [1] 12037 57
sqf_data <- sqf_data %>%
# left join selected spatial features from the sf object into sqf_data
left_join(sqf_data_sf %>% select(stop_id, pop_ba_above_est, pop_inpov_est, pop_asian_est, pop_hisp_est, pop_black_est), by = "stop_id") %>%
# drop x,y coords and geometry as we use census tract for spatial info
select(-c("stop_location_x", "stop_location_y", "geometry")) %>%
# drop obs with missing values in these spatial features
filter(!if_any(everything(), is.na))
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# check new dim
dim(sqf_data)
## [1] 12036 60
We first run LASSO, Ridge and Elastic Net regressions.
We split our sample as [edit].
We implement cross-validation in our training sample as [edit] . We use the default \(k=10\) folds for computational efficiency.
As we are operating in a classification setting, we choose the value
of the parameter lambda that minimizes the misclassification
error, and implement this by selecting
type.measure = "class" in our cv.glmnet regressions.
The cost ratio is: \[C = \frac{L(1,0)}{L(0,1)}\].
This measures how costly false negatives are relative to false positives. The optimal decision is \[\hat{y_i} = 1 \iff p_i > {1 \over 1+C}\]
We choose \(C=1\) and weight both types of misclassification equally, and so classify according to the most likely class i.e \[p_i > 0.5 \longrightarrow \hat{y} = 1\].
For each model, we also estimate using a matrix of predictors which excludes predictors related to search, physical force and frisk as [edit].
We evaluate all of our models based on their predictive performance:
We also look at the variable importance between models.
Set training and test data (70/30 split), and create a subset with fewer variables to test and train with.
# set seed for reproducibility
set.seed(1)
# set y and two predictor matrices (unsplit)
y <- sqf_data$suspect_arrested_flag
X <- model.matrix(~ . - suspect_arrested_flag - stop_id, data = sqf_data) #remove unique id
# set x subset, removing anything that is/might be realized during the stop
X_subset <- X[, !grepl("^(search|physical_force|.*eye_color)", colnames(X)) &
!colnames(X) %in% c("frisked_flag", "firearm_flag", "knife_cutter_flag",
"other_weapon_flag", "weapon_found_flag", "other_contraband_flag")]
colnames(X_subset)
## [1] "(Intercept)"
## [2] "year2"
## [3] "month2August"
## [4] "month2December"
## [5] "month2February"
## [6] "month2January"
## [7] "month2July"
## [8] "month2June"
## [9] "month2March"
## [10] "month2May"
## [11] "month2November"
## [12] "month2October"
## [13] "month2September"
## [14] "day2Monday"
## [15] "day2Saturday"
## [16] "day2Sunday"
## [17] "day2Thursday"
## [18] "day2Tuesday"
## [19] "day2Wednesday"
## [20] "stop_was_initiatedBased on Radio Run"
## [21] "stop_was_initiatedBased on Self Initiated"
## [22] "issuing_officer_rankDI"
## [23] "issuing_officer_rankDT2"
## [24] "issuing_officer_rankDT3"
## [25] "issuing_officer_rankDTS"
## [26] "issuing_officer_rankINS"
## [27] "issuing_officer_rankLSA"
## [28] "issuing_officer_rankLT"
## [29] "issuing_officer_rankPO"
## [30] "issuing_officer_rankPOF"
## [31] "issuing_officer_rankPOM"
## [32] "issuing_officer_rankSDS"
## [33] "issuing_officer_rankSGT"
## [34] "issuing_officer_rankSSA"
## [35] "supervising_officer_rankDI"
## [36] "supervising_officer_rankDT3"
## [37] "supervising_officer_rankDTS"
## [38] "supervising_officer_rankINS"
## [39] "supervising_officer_rankLCD"
## [40] "supervising_officer_rankLSA"
## [41] "supervising_officer_rankLT"
## [42] "supervising_officer_rankPO"
## [43] "supervising_officer_rankPOF"
## [44] "supervising_officer_rankPOM"
## [45] "supervising_officer_rankSDS"
## [46] "supervising_officer_rankSGT"
## [47] "supervising_officer_rankSSA"
## [48] "suspected_crime_descriptionAUTO STRIPPIG"
## [49] "suspected_crime_descriptionBURGLARY"
## [50] "suspected_crime_descriptionCPSP"
## [51] "suspected_crime_descriptionCPW"
## [52] "suspected_crime_descriptionCRIMINAL MISCHIEF"
## [53] "suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE"
## [54] "suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT"
## [55] "suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE"
## [56] "suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA"
## [57] "suspected_crime_descriptionCRIMINAL TRESPASS"
## [58] "suspected_crime_descriptionFORCIBLE TOUCHING"
## [59] "suspected_crime_descriptionGRAND LARCENY"
## [60] "suspected_crime_descriptionGRAND LARCENY AUTO"
## [61] "suspected_crime_descriptionMAKING GRAFFITI"
## [62] "suspected_crime_descriptionMENACING"
## [63] "suspected_crime_descriptionMURDER"
## [64] "suspected_crime_descriptionOTHER"
## [65] "suspected_crime_descriptionPETIT LARCENY"
## [66] "suspected_crime_descriptionRAPE"
## [67] "suspected_crime_descriptionRECKLESS ENDANGERMENT"
## [68] "suspected_crime_descriptionROBBERY"
## [69] "suspected_crime_descriptionTERRORISM"
## [70] "suspected_crime_descriptionTHEFT OF SERVICES"
## [71] "suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE"
## [72] "officer_explained_stop_flag"
## [73] "other_person_stopped_flag"
## [74] "officer_in_uniform_flag"
## [75] "ask_for_consent_flg"
## [76] "consent_given_flg"
## [77] "backround_circumstances_violent_crime_flag"
## [78] "backround_circumstances_suspect_known_to_carry_weapon_flag"
## [79] "suspects_actions_casing_flag"
## [80] "suspects_actions_concealed_possession_weapon_flag"
## [81] "suspects_actions_decription_flag"
## [82] "suspects_actions_drug_transactions_flag"
## [83] "suspects_actions_identify_crime_pattern_flag"
## [84] "suspects_actions_lookout_flag"
## [85] "suspects_actions_other_flag"
## [86] "suspects_actions_proximity_to_scene_flag"
## [87] "suspect_reported_age"
## [88] "suspect_sexMALE"
## [89] "suspect_race_descriptionASIAN / PACIFIC ISLANDER"
## [90] "suspect_race_descriptionBLACK"
## [91] "suspect_race_descriptionBLACK HISPANIC"
## [92] "suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN"
## [93] "suspect_race_descriptionWHITE"
## [94] "suspect_race_descriptionWHITE HISPANIC"
## [95] "suspect_height"
## [96] "suspect_weight"
## [97] "suspect_body_build_typeMED"
## [98] "suspect_body_build_typeTHN"
## [99] "suspect_body_build_typeU"
## [100] "suspect_body_build_typeXXX"
## [101] "suspect_hair_colorBLK"
## [102] "suspect_hair_colorBLN"
## [103] "suspect_hair_colorBRO"
## [104] "suspect_hair_colorGRN"
## [105] "suspect_hair_colorGRY"
## [106] "suspect_hair_colorORG"
## [107] "suspect_hair_colorPLE"
## [108] "suspect_hair_colorPNK"
## [109] "suspect_hair_colorRED"
## [110] "suspect_hair_colorSDY"
## [111] "suspect_hair_colorWHI"
## [112] "suspect_hair_colorXXX"
## [113] "suspect_hair_colorZZZ"
## [114] "stop_location_boro_nameBROOKLYN"
## [115] "stop_location_boro_nameMANHATTAN"
## [116] "stop_location_boro_nameQUEENS"
## [117] "stop_location_boro_nameSTATEN ISLAND"
## [118] "time_of_dayAfternoon"
## [119] "time_of_dayEvening"
## [120] "time_of_dayNight"
## [121] "pop_ba_above_est"
## [122] "pop_inpov_est"
## [123] "pop_asian_est"
## [124] "pop_hisp_est"
## [125] "pop_black_est"
# perform train-test split
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
y_train <- y[train_index]
y_test <- y[-train_index]
X_train <- X[train_index,]
X_test <- X[-train_index, ]
X_train_subset <- X_subset[train_index, ]
X_test_subset <- X_subset[-train_index, ]
# print lengths and dimensions
cat("Length of y_train:", length(y_train), "\n")
## Length of y_train: 8426
cat("Length of y_test:", length(y_test), "\n")
## Length of y_test: 3610
cat("Dimensions of X_train:", dim(X_train), "\n")
## Dimensions of X_train: 8426 156
cat("Dimensions of X_test:", dim(X_test), "\n")
## Dimensions of X_test: 3610 156
cat("Dimensions of X_train_subset:", dim(X_train_subset), "\n")
## Dimensions of X_train_subset: 8426 125
cat("Dimensions of X_test_subset:", dim(X_test_subset), "\n")
## Dimensions of X_test_subset: 3610 125
# check balance of y
cat("Balance of y_train:\n")
## Balance of y_train:
print(table(y_train))
## y_train
## 0 1
## 5812 2614
cat("Balance of y_test:\n")
## Balance of y_test:
print(table(y_test))
## y_test
## 0 1
## 2454 1156
##Logistic Regression `
# For logistic regression we need to convert X_train and X_test to a data.frame
X_train_df <- as.data.frame(X_train)
X_train_subset_df <- as.data.frame(X_train_subset)
X_test_df <- as.data.frame(X_test)
X_test_subset_df <- as.data.frame(X_test_subset)
#Logistic Regression for all variables and the subset
logit_all <- glm(y_train ~ ., data = X_train_df, family = binomial(logit))
summary(logit_all)
##
## Call:
## glm(formula = y_train ~ ., family = binomial(logit), data = X_train_df)
##
## Coefficients: (3 not defined because of singularities)
## Estimate
## (Intercept) 9.303e-01
## `(Intercept)` NA
## year2 NA
## month2August 1.599e-01
## month2December 1.662e-02
## month2February -1.922e-02
## month2January -7.889e-02
## month2July 3.976e-03
## month2June -1.793e-01
## month2March -1.209e-02
## month2May -1.057e-03
## month2November -3.390e-02
## month2October 2.732e-01
## month2September 6.641e-02
## day2Monday -3.250e-01
## day2Saturday -9.166e-02
## day2Sunday -2.915e-01
## day2Thursday -4.449e-01
## day2Tuesday -2.716e-01
## day2Wednesday -3.086e-01
## `stop_was_initiatedBased on Radio Run` 2.168e-02
## `stop_was_initiatedBased on Self Initiated` -5.128e-01
## issuing_officer_rankDI -2.420e+00
## issuing_officer_rankDT2 -1.497e+01
## issuing_officer_rankDT3 -1.052e+00
## issuing_officer_rankDTS -2.182e+00
## issuing_officer_rankINS -1.308e+01
## issuing_officer_rankLSA -2.491e+00
## issuing_officer_rankLT -1.682e+00
## issuing_officer_rankPO -2.009e+00
## issuing_officer_rankPOF -1.867e+00
## issuing_officer_rankPOM -1.888e+00
## issuing_officer_rankSDS -1.054e+01
## issuing_officer_rankSGT -1.941e+00
## issuing_officer_rankSSA -1.946e+00
## supervising_officer_rankDI 6.733e-01
## supervising_officer_rankDT3 -1.585e+01
## supervising_officer_rankDTS NA
## supervising_officer_rankINS 7.066e-01
## supervising_officer_rankLCD -1.136e+01
## supervising_officer_rankLSA -9.157e-01
## supervising_officer_rankLT 4.895e-01
## supervising_officer_rankPO 2.015e+00
## supervising_officer_rankPOF -1.426e+01
## supervising_officer_rankPOM -8.047e-02
## supervising_officer_rankSDS -5.919e-01
## supervising_officer_rankSGT 3.136e-01
## supervising_officer_rankSSA 3.689e-01
## `suspected_crime_descriptionAUTO STRIPPIG` 7.043e-01
## suspected_crime_descriptionBURGLARY -3.861e-02
## suspected_crime_descriptionCPSP 5.472e-01
## suspected_crime_descriptionCPW -1.214e+00
## `suspected_crime_descriptionCRIMINAL MISCHIEF` 5.432e-01
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE` -7.078e-01
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT` -1.301e+01
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE` -1.039e+00
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA` -1.022e+00
## `suspected_crime_descriptionCRIMINAL TRESPASS` 1.053e-01
## `suspected_crime_descriptionFORCIBLE TOUCHING` 2.857e+00
## `suspected_crime_descriptionGRAND LARCENY` 4.969e-01
## `suspected_crime_descriptionGRAND LARCENY AUTO` 3.739e-01
## `suspected_crime_descriptionMAKING GRAFFITI` 1.220e+00
## suspected_crime_descriptionMENACING -3.365e-02
## suspected_crime_descriptionMURDER -6.718e-01
## suspected_crime_descriptionOTHER -3.750e-01
## `suspected_crime_descriptionPETIT LARCENY` 6.294e-01
## suspected_crime_descriptionRAPE -2.473e+00
## `suspected_crime_descriptionRECKLESS ENDANGERMENT` -2.828e-02
## suspected_crime_descriptionROBBERY 2.468e-01
## suspected_crime_descriptionTERRORISM -1.279e+01
## `suspected_crime_descriptionTHEFT OF SERVICES` -1.676e-01
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE` 4.478e-01
## officer_explained_stop_flag -7.595e-01
## other_person_stopped_flag -1.212e-01
## officer_in_uniform_flag 4.684e-01
## frisked_flag -1.199e-01
## searched_flag 9.343e-01
## ask_for_consent_flg 1.886e-02
## consent_given_flg -5.914e-01
## other_contraband_flag 2.147e+00
## firearm_flag 2.984e+00
## knife_cutter_flag -8.689e-01
## other_weapon_flag -6.022e-01
## weapon_found_flag 1.722e+00
## physical_force_cew_flag 4.302e-01
## physical_force_draw_point_firearm_flag 9.102e-02
## physical_force_handcuff_suspect_flag 3.894e-01
## physical_force_oc_spray_used_flag 1.587e+01
## physical_force_other_flag -2.724e-01
## physical_force_restraint_used_flag 2.336e-01
## physical_force_verbal_instruction_flag 1.317e-01
## physical_force_weapon_impact_flag 1.355e+01
## backround_circumstances_violent_crime_flag 6.574e-02
## backround_circumstances_suspect_known_to_carry_weapon_flag 4.091e-01
## suspects_actions_casing_flag -1.639e-01
## suspects_actions_concealed_possession_weapon_flag -3.421e-01
## suspects_actions_decription_flag 4.220e-01
## suspects_actions_drug_transactions_flag 2.703e-01
## suspects_actions_identify_crime_pattern_flag -7.379e-01
## suspects_actions_lookout_flag 2.848e-01
## suspects_actions_other_flag -4.667e-03
## suspects_actions_proximity_to_scene_flag 8.030e-02
## search_basis_admission_flag -1.135e-01
## search_basis_consent_flag 5.209e-02
## search_basis_hard_object_flag 2.059e-02
## search_basis_incidental_to_arrest_flag 3.087e+00
## search_basis_other_flag 4.007e-01
## search_basis_outline_flag 3.947e-01
## suspect_reported_age 3.136e-02
## suspect_sexMALE 4.276e-02
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` -1.260e+00
## suspect_race_descriptionBLACK -1.045e+00
## `suspect_race_descriptionBLACK HISPANIC` -1.177e+00
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` -1.294e+00
## suspect_race_descriptionWHITE -1.521e+00
## `suspect_race_descriptionWHITE HISPANIC` -1.064e+00
## suspect_height 3.867e-03
## suspect_weight -1.052e-01
## suspect_body_build_typeMED -2.964e-01
## suspect_body_build_typeTHN -2.690e-01
## suspect_body_build_typeU -3.737e-01
## suspect_body_build_typeXXX -1.181e+00
## suspect_eye_colorBLU -2.923e-01
## suspect_eye_colorBRO 1.972e-02
## suspect_eye_colorGRN 2.044e-01
## suspect_eye_colorGRY 2.019e-01
## suspect_eye_colorHAZ -4.387e-01
## suspect_eye_colorMAR -1.144e+01
## suspect_eye_colorMUL 1.628e+00
## suspect_eye_colorOTH 2.042e+00
## suspect_eye_colorPNK -9.872e+00
## suspect_eye_colorZZZ -8.609e-02
## suspect_hair_colorBLK -3.853e-01
## suspect_hair_colorBLN -2.480e-01
## suspect_hair_colorBRO -2.064e-01
## suspect_hair_colorGRN -1.423e+01
## suspect_hair_colorGRY -3.097e-01
## suspect_hair_colorORG -2.523e+00
## suspect_hair_colorPLE -1.375e+01
## suspect_hair_colorPNK -9.413e-01
## suspect_hair_colorRED 6.341e-01
## suspect_hair_colorSDY 3.865e-01
## suspect_hair_colorWHI -3.604e-01
## suspect_hair_colorXXX -1.922e+00
## suspect_hair_colorZZZ -2.044e-01
## stop_location_boro_nameBROOKLYN 4.619e-01
## stop_location_boro_nameMANHATTAN 3.177e-01
## stop_location_boro_nameQUEENS 1.814e-01
## `stop_location_boro_nameSTATEN ISLAND` 1.943e-01
## time_of_dayAfternoon 9.688e-02
## time_of_dayEvening -1.624e-01
## time_of_dayNight 2.516e-02
## pop_ba_above_est 7.092e-05
## pop_inpov_est 1.077e-04
## pop_asian_est 3.292e-05
## pop_hisp_est 3.464e-05
## pop_black_est -5.761e-05
## Std. Error
## (Intercept) 1.846e+00
## `(Intercept)` NA
## year2 NA
## month2August 2.022e-01
## month2December 2.337e-01
## month2February 2.053e-01
## month2January 1.997e-01
## month2July 2.021e-01
## month2June 2.127e-01
## month2March 2.003e-01
## month2May 1.970e-01
## month2November 2.240e-01
## month2October 2.022e-01
## month2September 2.204e-01
## day2Monday 1.654e-01
## day2Saturday 1.485e-01
## day2Sunday 1.557e-01
## day2Thursday 1.512e-01
## day2Tuesday 1.474e-01
## day2Wednesday 1.471e-01
## `stop_was_initiatedBased on Radio Run` 1.256e-01
## `stop_was_initiatedBased on Self Initiated` 1.676e-01
## issuing_officer_rankDI 2.165e+00
## issuing_officer_rankDT2 7.001e+02
## issuing_officer_rankDT3 1.474e+00
## issuing_officer_rankDTS 1.439e+00
## issuing_officer_rankINS 1.455e+03
## issuing_officer_rankLSA 2.427e+00
## issuing_officer_rankLT 1.426e+00
## issuing_officer_rankPO 1.416e+00
## issuing_officer_rankPOF 1.421e+00
## issuing_officer_rankPOM 1.413e+00
## issuing_officer_rankSDS 9.154e+02
## issuing_officer_rankSGT 1.440e+00
## issuing_officer_rankSSA 1.707e+00
## supervising_officer_rankDI 1.320e+00
## supervising_officer_rankDT3 8.508e+02
## supervising_officer_rankDTS NA
## supervising_officer_rankINS 1.745e+00
## supervising_officer_rankLCD 9.154e+02
## supervising_officer_rankLSA 1.224e+00
## supervising_officer_rankLT 8.171e-01
## supervising_officer_rankPO 1.299e+00
## supervising_officer_rankPOF 1.455e+03
## supervising_officer_rankPOM 1.350e+00
## supervising_officer_rankSDS 1.446e+00
## supervising_officer_rankSGT 8.192e-01
## supervising_officer_rankSSA 8.712e-01
## `suspected_crime_descriptionAUTO STRIPPIG` 6.407e-01
## suspected_crime_descriptionBURGLARY 2.009e-01
## suspected_crime_descriptionCPSP 4.410e-01
## suspected_crime_descriptionCPW 1.780e-01
## `suspected_crime_descriptionCRIMINAL MISCHIEF` 2.951e-01
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE` 8.338e-01
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT` 8.208e+02
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE` 8.970e-01
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA` 1.556e+00
## `suspected_crime_descriptionCRIMINAL TRESPASS` 3.583e-01
## `suspected_crime_descriptionFORCIBLE TOUCHING` 1.107e+00
## `suspected_crime_descriptionGRAND LARCENY` 2.234e-01
## `suspected_crime_descriptionGRAND LARCENY AUTO` 2.299e-01
## `suspected_crime_descriptionMAKING GRAFFITI` 6.031e-01
## suspected_crime_descriptionMENACING 2.624e-01
## suspected_crime_descriptionMURDER 7.380e-01
## suspected_crime_descriptionOTHER 2.465e-01
## `suspected_crime_descriptionPETIT LARCENY` 1.889e-01
## suspected_crime_descriptionRAPE 1.350e+00
## `suspected_crime_descriptionRECKLESS ENDANGERMENT` 3.916e-01
## suspected_crime_descriptionROBBERY 1.665e-01
## suspected_crime_descriptionTERRORISM 8.196e+02
## `suspected_crime_descriptionTHEFT OF SERVICES` 1.079e+00
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE` 5.600e-01
## officer_explained_stop_flag 1.671e-01
## other_person_stopped_flag 9.637e-02
## officer_in_uniform_flag 3.452e-01
## frisked_flag 1.261e-01
## searched_flag 1.903e-01
## ask_for_consent_flg 1.930e-01
## consent_given_flg 1.938e-01
## other_contraband_flag 2.016e-01
## firearm_flag 4.301e-01
## knife_cutter_flag 4.200e-01
## other_weapon_flag 4.519e-01
## weapon_found_flag 4.295e-01
## physical_force_cew_flag 5.197e-01
## physical_force_draw_point_firearm_flag 2.094e-01
## physical_force_handcuff_suspect_flag 1.128e-01
## physical_force_oc_spray_used_flag 1.455e+03
## physical_force_other_flag 3.024e-01
## physical_force_restraint_used_flag 2.106e-01
## physical_force_verbal_instruction_flag 1.592e-01
## physical_force_weapon_impact_flag 8.393e+02
## backround_circumstances_violent_crime_flag 1.318e-01
## backround_circumstances_suspect_known_to_carry_weapon_flag 2.737e-01
## suspects_actions_casing_flag 2.438e-01
## suspects_actions_concealed_possession_weapon_flag 1.590e-01
## suspects_actions_decription_flag 1.220e-01
## suspects_actions_drug_transactions_flag 7.400e-01
## suspects_actions_identify_crime_pattern_flag 5.410e-01
## suspects_actions_lookout_flag 3.823e-01
## suspects_actions_other_flag 1.229e-01
## suspects_actions_proximity_to_scene_flag 9.872e-02
## search_basis_admission_flag 2.813e-01
## search_basis_consent_flag 2.416e-01
## search_basis_hard_object_flag 1.965e-01
## search_basis_incidental_to_arrest_flag 1.892e-01
## search_basis_other_flag 2.114e-01
## search_basis_outline_flag 2.286e-01
## suspect_reported_age 4.862e-02
## suspect_sexMALE 1.646e-01
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` 1.009e+00
## suspect_race_descriptionBLACK 9.715e-01
## `suspect_race_descriptionBLACK HISPANIC` 9.801e-01
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` 1.060e+00
## suspect_race_descriptionWHITE 9.848e-01
## `suspect_race_descriptionWHITE HISPANIC` 9.735e-01
## suspect_height 4.349e-02
## suspect_weight 5.729e-02
## suspect_body_build_typeMED 1.778e-01
## suspect_body_build_typeTHN 1.955e-01
## suspect_body_build_typeU 3.157e-01
## suspect_body_build_typeXXX 1.008e+00
## suspect_eye_colorBLU 4.089e-01
## suspect_eye_colorBRO 1.285e-01
## suspect_eye_colorGRN 5.508e-01
## suspect_eye_colorGRY 1.055e+00
## suspect_eye_colorHAZ 6.230e-01
## suspect_eye_colorMAR 1.455e+03
## suspect_eye_colorMUL 2.301e+00
## suspect_eye_colorOTH 1.935e+00
## suspect_eye_colorPNK 1.455e+03
## suspect_eye_colorZZZ 2.366e-01
## suspect_hair_colorBLK 2.045e-01
## suspect_hair_colorBLN 4.147e-01
## suspect_hair_colorBRO 2.366e-01
## suspect_hair_colorGRN 1.455e+03
## suspect_hair_colorGRY 3.475e-01
## suspect_hair_colorORG 2.137e+00
## suspect_hair_colorPLE 9.090e+02
## suspect_hair_colorPNK 1.600e+00
## suspect_hair_colorRED 5.146e-01
## suspect_hair_colorSDY 1.154e+00
## suspect_hair_colorWHI 7.884e-01
## suspect_hair_colorXXX 3.241e-01
## suspect_hair_colorZZZ 5.954e-01
## stop_location_boro_nameBROOKLYN 1.350e-01
## stop_location_boro_nameMANHATTAN 1.462e-01
## stop_location_boro_nameQUEENS 1.589e-01
## `stop_location_boro_nameSTATEN ISLAND` 2.258e-01
## time_of_dayAfternoon 1.585e-01
## time_of_dayEvening 1.548e-01
## time_of_dayNight 1.622e-01
## pop_ba_above_est 3.903e-05
## pop_inpov_est 7.312e-05
## pop_asian_est 6.268e-05
## pop_hisp_est 3.734e-05
## pop_black_est 3.139e-05
## z value
## (Intercept) 0.504
## `(Intercept)` NA
## year2 NA
## month2August 0.791
## month2December 0.071
## month2February -0.094
## month2January -0.395
## month2July 0.020
## month2June -0.843
## month2March -0.060
## month2May -0.005
## month2November -0.151
## month2October 1.351
## month2September 0.301
## day2Monday -1.965
## day2Saturday -0.617
## day2Sunday -1.872
## day2Thursday -2.942
## day2Tuesday -1.843
## day2Wednesday -2.097
## `stop_was_initiatedBased on Radio Run` 0.173
## `stop_was_initiatedBased on Self Initiated` -3.060
## issuing_officer_rankDI -1.118
## issuing_officer_rankDT2 -0.021
## issuing_officer_rankDT3 -0.714
## issuing_officer_rankDTS -1.516
## issuing_officer_rankINS -0.009
## issuing_officer_rankLSA -1.027
## issuing_officer_rankLT -1.179
## issuing_officer_rankPO -1.419
## issuing_officer_rankPOF -1.314
## issuing_officer_rankPOM -1.336
## issuing_officer_rankSDS -0.012
## issuing_officer_rankSGT -1.348
## issuing_officer_rankSSA -1.140
## supervising_officer_rankDI 0.510
## supervising_officer_rankDT3 -0.019
## supervising_officer_rankDTS NA
## supervising_officer_rankINS 0.405
## supervising_officer_rankLCD -0.012
## supervising_officer_rankLSA -0.748
## supervising_officer_rankLT 0.599
## supervising_officer_rankPO 1.551
## supervising_officer_rankPOF -0.010
## supervising_officer_rankPOM -0.060
## supervising_officer_rankSDS -0.409
## supervising_officer_rankSGT 0.383
## supervising_officer_rankSSA 0.424
## `suspected_crime_descriptionAUTO STRIPPIG` 1.099
## suspected_crime_descriptionBURGLARY -0.192
## suspected_crime_descriptionCPSP 1.241
## suspected_crime_descriptionCPW -6.818
## `suspected_crime_descriptionCRIMINAL MISCHIEF` 1.840
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE` -0.849
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT` -0.016
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE` -1.159
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA` -0.657
## `suspected_crime_descriptionCRIMINAL TRESPASS` 0.294
## `suspected_crime_descriptionFORCIBLE TOUCHING` 2.581
## `suspected_crime_descriptionGRAND LARCENY` 2.224
## `suspected_crime_descriptionGRAND LARCENY AUTO` 1.626
## `suspected_crime_descriptionMAKING GRAFFITI` 2.023
## suspected_crime_descriptionMENACING -0.128
## suspected_crime_descriptionMURDER -0.910
## suspected_crime_descriptionOTHER -1.521
## `suspected_crime_descriptionPETIT LARCENY` 3.331
## suspected_crime_descriptionRAPE -1.831
## `suspected_crime_descriptionRECKLESS ENDANGERMENT` -0.072
## suspected_crime_descriptionROBBERY 1.482
## suspected_crime_descriptionTERRORISM -0.016
## `suspected_crime_descriptionTHEFT OF SERVICES` -0.155
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE` 0.800
## officer_explained_stop_flag -4.545
## other_person_stopped_flag -1.257
## officer_in_uniform_flag 1.357
## frisked_flag -0.950
## searched_flag 4.909
## ask_for_consent_flg 0.098
## consent_given_flg -3.052
## other_contraband_flag 10.650
## firearm_flag 6.938
## knife_cutter_flag -2.069
## other_weapon_flag -1.332
## weapon_found_flag 4.010
## physical_force_cew_flag 0.828
## physical_force_draw_point_firearm_flag 0.435
## physical_force_handcuff_suspect_flag 3.451
## physical_force_oc_spray_used_flag 0.011
## physical_force_other_flag -0.901
## physical_force_restraint_used_flag 1.109
## physical_force_verbal_instruction_flag 0.827
## physical_force_weapon_impact_flag 0.016
## backround_circumstances_violent_crime_flag 0.499
## backround_circumstances_suspect_known_to_carry_weapon_flag 1.495
## suspects_actions_casing_flag -0.672
## suspects_actions_concealed_possession_weapon_flag -2.152
## suspects_actions_decription_flag 3.459
## suspects_actions_drug_transactions_flag 0.365
## suspects_actions_identify_crime_pattern_flag -1.364
## suspects_actions_lookout_flag 0.745
## suspects_actions_other_flag -0.038
## suspects_actions_proximity_to_scene_flag 0.813
## search_basis_admission_flag -0.404
## search_basis_consent_flag 0.216
## search_basis_hard_object_flag 0.105
## search_basis_incidental_to_arrest_flag 16.314
## search_basis_other_flag 1.895
## search_basis_outline_flag 1.726
## suspect_reported_age 0.645
## suspect_sexMALE 0.260
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` -1.248
## suspect_race_descriptionBLACK -1.076
## `suspect_race_descriptionBLACK HISPANIC` -1.201
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` -1.221
## suspect_race_descriptionWHITE -1.545
## `suspect_race_descriptionWHITE HISPANIC` -1.093
## suspect_height 0.089
## suspect_weight -1.837
## suspect_body_build_typeMED -1.667
## suspect_body_build_typeTHN -1.376
## suspect_body_build_typeU -1.184
## suspect_body_build_typeXXX -1.172
## suspect_eye_colorBLU -0.715
## suspect_eye_colorBRO 0.153
## suspect_eye_colorGRN 0.371
## suspect_eye_colorGRY 0.191
## suspect_eye_colorHAZ -0.704
## suspect_eye_colorMAR -0.008
## suspect_eye_colorMUL 0.708
## suspect_eye_colorOTH 1.055
## suspect_eye_colorPNK -0.007
## suspect_eye_colorZZZ -0.364
## suspect_hair_colorBLK -1.884
## suspect_hair_colorBLN -0.598
## suspect_hair_colorBRO -0.872
## suspect_hair_colorGRN -0.010
## suspect_hair_colorGRY -0.891
## suspect_hair_colorORG -1.181
## suspect_hair_colorPLE -0.015
## suspect_hair_colorPNK -0.588
## suspect_hair_colorRED 1.232
## suspect_hair_colorSDY 0.335
## suspect_hair_colorWHI -0.457
## suspect_hair_colorXXX -5.931
## suspect_hair_colorZZZ -0.343
## stop_location_boro_nameBROOKLYN 3.422
## stop_location_boro_nameMANHATTAN 2.173
## stop_location_boro_nameQUEENS 1.142
## `stop_location_boro_nameSTATEN ISLAND` 0.860
## time_of_dayAfternoon 0.611
## time_of_dayEvening -1.049
## time_of_dayNight 0.155
## pop_ba_above_est 1.817
## pop_inpov_est 1.472
## pop_asian_est 0.525
## pop_hisp_est 0.928
## pop_black_est -1.835
## Pr(>|z|)
## (Intercept) 0.614329
## `(Intercept)` NA
## year2 NA
## month2August 0.429056
## month2December 0.943308
## month2February 0.925442
## month2January 0.692753
## month2July 0.984303
## month2June 0.399284
## month2March 0.951866
## month2May 0.995719
## month2November 0.879734
## month2October 0.176747
## month2September 0.763113
## day2Monday 0.049371
## day2Saturday 0.537177
## day2Sunday 0.061187
## day2Thursday 0.003260
## day2Tuesday 0.065336
## day2Wednesday 0.035991
## `stop_was_initiatedBased on Radio Run` 0.862988
## `stop_was_initiatedBased on Self Initiated` 0.002212
## issuing_officer_rankDI 0.263630
## issuing_officer_rankDT2 0.982940
## issuing_officer_rankDT3 0.475435
## issuing_officer_rankDTS 0.129406
## issuing_officer_rankINS 0.992829
## issuing_officer_rankLSA 0.304633
## issuing_officer_rankLT 0.238296
## issuing_officer_rankPO 0.155854
## issuing_officer_rankPOF 0.188875
## issuing_officer_rankPOM 0.181451
## issuing_officer_rankSDS 0.990812
## issuing_officer_rankSGT 0.177668
## issuing_officer_rankSSA 0.254260
## supervising_officer_rankDI 0.610119
## supervising_officer_rankDT3 0.985140
## supervising_officer_rankDTS NA
## supervising_officer_rankINS 0.685588
## supervising_officer_rankLCD 0.990098
## supervising_officer_rankLSA 0.454385
## supervising_officer_rankLT 0.549138
## supervising_officer_rankPO 0.120821
## supervising_officer_rankPOF 0.992184
## supervising_officer_rankPOM 0.952485
## supervising_officer_rankSDS 0.682390
## supervising_officer_rankSGT 0.701913
## supervising_officer_rankSSA 0.671926
## `suspected_crime_descriptionAUTO STRIPPIG` 0.271676
## suspected_crime_descriptionBURGLARY 0.847603
## suspected_crime_descriptionCPSP 0.214616
## suspected_crime_descriptionCPW 9.26e-12
## `suspected_crime_descriptionCRIMINAL MISCHIEF` 0.065700
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE` 0.395911
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT` 0.987352
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE` 0.246533
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA` 0.511454
## `suspected_crime_descriptionCRIMINAL TRESPASS` 0.768809
## `suspected_crime_descriptionFORCIBLE TOUCHING` 0.009840
## `suspected_crime_descriptionGRAND LARCENY` 0.026122
## `suspected_crime_descriptionGRAND LARCENY AUTO` 0.103892
## `suspected_crime_descriptionMAKING GRAFFITI` 0.043105
## suspected_crime_descriptionMENACING 0.897981
## suspected_crime_descriptionMURDER 0.362664
## suspected_crime_descriptionOTHER 0.128176
## `suspected_crime_descriptionPETIT LARCENY` 0.000864
## suspected_crime_descriptionRAPE 0.067044
## `suspected_crime_descriptionRECKLESS ENDANGERMENT` 0.942431
## suspected_crime_descriptionROBBERY 0.138255
## suspected_crime_descriptionTERRORISM 0.987553
## `suspected_crime_descriptionTHEFT OF SERVICES` 0.876572
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE` 0.423965
## officer_explained_stop_flag 5.50e-06
## other_person_stopped_flag 0.208683
## officer_in_uniform_flag 0.174767
## frisked_flag 0.341883
## searched_flag 9.15e-07
## ask_for_consent_flg 0.922141
## consent_given_flg 0.002271
## other_contraband_flag < 2e-16
## firearm_flag 3.98e-12
## knife_cutter_flag 0.038582
## other_weapon_flag 0.182737
## weapon_found_flag 6.07e-05
## physical_force_cew_flag 0.407808
## physical_force_draw_point_firearm_flag 0.663784
## physical_force_handcuff_suspect_flag 0.000558
## physical_force_oc_spray_used_flag 0.991300
## physical_force_other_flag 0.367811
## physical_force_restraint_used_flag 0.267430
## physical_force_verbal_instruction_flag 0.408076
## physical_force_weapon_impact_flag 0.987121
## backround_circumstances_violent_crime_flag 0.617989
## backround_circumstances_suspect_known_to_carry_weapon_flag 0.134993
## suspects_actions_casing_flag 0.501436
## suspects_actions_concealed_possession_weapon_flag 0.031396
## suspects_actions_decription_flag 0.000542
## suspects_actions_drug_transactions_flag 0.714916
## suspects_actions_identify_crime_pattern_flag 0.172585
## suspects_actions_lookout_flag 0.456343
## suspects_actions_other_flag 0.969715
## suspects_actions_proximity_to_scene_flag 0.415972
## search_basis_admission_flag 0.686470
## search_basis_consent_flag 0.829331
## search_basis_hard_object_flag 0.916532
## search_basis_incidental_to_arrest_flag < 2e-16
## search_basis_other_flag 0.058046
## search_basis_outline_flag 0.084325
## suspect_reported_age 0.518999
## suspect_sexMALE 0.795072
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` 0.212050
## suspect_race_descriptionBLACK 0.281867
## `suspect_race_descriptionBLACK HISPANIC` 0.229617
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` 0.222129
## suspect_race_descriptionWHITE 0.122337
## `suspect_race_descriptionWHITE HISPANIC` 0.274552
## suspect_height 0.929153
## suspect_weight 0.066278
## suspect_body_build_typeMED 0.095448
## suspect_body_build_typeTHN 0.168879
## suspect_body_build_typeU 0.236481
## suspect_body_build_typeXXX 0.241386
## suspect_eye_colorBLU 0.474702
## suspect_eye_colorBRO 0.878016
## suspect_eye_colorGRN 0.710607
## suspect_eye_colorGRY 0.848160
## suspect_eye_colorHAZ 0.481345
## suspect_eye_colorMAR 0.993731
## suspect_eye_colorMUL 0.479059
## suspect_eye_colorOTH 0.291240
## suspect_eye_colorPNK 0.994588
## suspect_eye_colorZZZ 0.716007
## suspect_hair_colorBLK 0.059589
## suspect_hair_colorBLN 0.549900
## suspect_hair_colorBRO 0.383022
## suspect_hair_colorGRN 0.992200
## suspect_hair_colorGRY 0.372847
## suspect_hair_colorORG 0.237772
## suspect_hair_colorPLE 0.987933
## suspect_hair_colorPNK 0.556390
## suspect_hair_colorRED 0.217863
## suspect_hair_colorSDY 0.737708
## suspect_hair_colorWHI 0.647634
## suspect_hair_colorXXX 3.00e-09
## suspect_hair_colorZZZ 0.731370
## stop_location_boro_nameBROOKLYN 0.000621
## stop_location_boro_nameMANHATTAN 0.029803
## stop_location_boro_nameQUEENS 0.253560
## `stop_location_boro_nameSTATEN ISLAND` 0.389698
## time_of_dayAfternoon 0.541137
## time_of_dayEvening 0.294189
## time_of_dayNight 0.876752
## pop_ba_above_est 0.069220
## pop_inpov_est 0.140975
## pop_asian_est 0.599407
## pop_hisp_est 0.353579
## pop_black_est 0.066460
##
## (Intercept)
## `(Intercept)`
## year2
## month2August
## month2December
## month2February
## month2January
## month2July
## month2June
## month2March
## month2May
## month2November
## month2October
## month2September
## day2Monday *
## day2Saturday
## day2Sunday .
## day2Thursday **
## day2Tuesday .
## day2Wednesday *
## `stop_was_initiatedBased on Radio Run`
## `stop_was_initiatedBased on Self Initiated` **
## issuing_officer_rankDI
## issuing_officer_rankDT2
## issuing_officer_rankDT3
## issuing_officer_rankDTS
## issuing_officer_rankINS
## issuing_officer_rankLSA
## issuing_officer_rankLT
## issuing_officer_rankPO
## issuing_officer_rankPOF
## issuing_officer_rankPOM
## issuing_officer_rankSDS
## issuing_officer_rankSGT
## issuing_officer_rankSSA
## supervising_officer_rankDI
## supervising_officer_rankDT3
## supervising_officer_rankDTS
## supervising_officer_rankINS
## supervising_officer_rankLCD
## supervising_officer_rankLSA
## supervising_officer_rankLT
## supervising_officer_rankPO
## supervising_officer_rankPOF
## supervising_officer_rankPOM
## supervising_officer_rankSDS
## supervising_officer_rankSGT
## supervising_officer_rankSSA
## `suspected_crime_descriptionAUTO STRIPPIG`
## suspected_crime_descriptionBURGLARY
## suspected_crime_descriptionCPSP
## suspected_crime_descriptionCPW ***
## `suspected_crime_descriptionCRIMINAL MISCHIEF` .
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE`
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT`
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE`
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA`
## `suspected_crime_descriptionCRIMINAL TRESPASS`
## `suspected_crime_descriptionFORCIBLE TOUCHING` **
## `suspected_crime_descriptionGRAND LARCENY` *
## `suspected_crime_descriptionGRAND LARCENY AUTO`
## `suspected_crime_descriptionMAKING GRAFFITI` *
## suspected_crime_descriptionMENACING
## suspected_crime_descriptionMURDER
## suspected_crime_descriptionOTHER
## `suspected_crime_descriptionPETIT LARCENY` ***
## suspected_crime_descriptionRAPE .
## `suspected_crime_descriptionRECKLESS ENDANGERMENT`
## suspected_crime_descriptionROBBERY
## suspected_crime_descriptionTERRORISM
## `suspected_crime_descriptionTHEFT OF SERVICES`
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE`
## officer_explained_stop_flag ***
## other_person_stopped_flag
## officer_in_uniform_flag
## frisked_flag
## searched_flag ***
## ask_for_consent_flg
## consent_given_flg **
## other_contraband_flag ***
## firearm_flag ***
## knife_cutter_flag *
## other_weapon_flag
## weapon_found_flag ***
## physical_force_cew_flag
## physical_force_draw_point_firearm_flag
## physical_force_handcuff_suspect_flag ***
## physical_force_oc_spray_used_flag
## physical_force_other_flag
## physical_force_restraint_used_flag
## physical_force_verbal_instruction_flag
## physical_force_weapon_impact_flag
## backround_circumstances_violent_crime_flag
## backround_circumstances_suspect_known_to_carry_weapon_flag
## suspects_actions_casing_flag
## suspects_actions_concealed_possession_weapon_flag *
## suspects_actions_decription_flag ***
## suspects_actions_drug_transactions_flag
## suspects_actions_identify_crime_pattern_flag
## suspects_actions_lookout_flag
## suspects_actions_other_flag
## suspects_actions_proximity_to_scene_flag
## search_basis_admission_flag
## search_basis_consent_flag
## search_basis_hard_object_flag
## search_basis_incidental_to_arrest_flag ***
## search_basis_other_flag .
## search_basis_outline_flag .
## suspect_reported_age
## suspect_sexMALE
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER`
## suspect_race_descriptionBLACK
## `suspect_race_descriptionBLACK HISPANIC`
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN`
## suspect_race_descriptionWHITE
## `suspect_race_descriptionWHITE HISPANIC`
## suspect_height
## suspect_weight .
## suspect_body_build_typeMED .
## suspect_body_build_typeTHN
## suspect_body_build_typeU
## suspect_body_build_typeXXX
## suspect_eye_colorBLU
## suspect_eye_colorBRO
## suspect_eye_colorGRN
## suspect_eye_colorGRY
## suspect_eye_colorHAZ
## suspect_eye_colorMAR
## suspect_eye_colorMUL
## suspect_eye_colorOTH
## suspect_eye_colorPNK
## suspect_eye_colorZZZ
## suspect_hair_colorBLK .
## suspect_hair_colorBLN
## suspect_hair_colorBRO
## suspect_hair_colorGRN
## suspect_hair_colorGRY
## suspect_hair_colorORG
## suspect_hair_colorPLE
## suspect_hair_colorPNK
## suspect_hair_colorRED
## suspect_hair_colorSDY
## suspect_hair_colorWHI
## suspect_hair_colorXXX ***
## suspect_hair_colorZZZ
## stop_location_boro_nameBROOKLYN ***
## stop_location_boro_nameMANHATTAN *
## stop_location_boro_nameQUEENS
## `stop_location_boro_nameSTATEN ISLAND`
## time_of_dayAfternoon
## time_of_dayEvening
## time_of_dayNight
## pop_ba_above_est .
## pop_inpov_est
## pop_asian_est
## pop_hisp_est
## pop_black_est .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10436.2 on 8425 degrees of freedom
## Residual deviance: 4218.3 on 8272 degrees of freedom
## AIC: 4526.3
##
## Number of Fisher Scoring iterations: 14
logit_sub <- glm(y_train ~ ., data = X_train_subset_df, family = binomial(logit))
summary(logit_sub)
##
## Call:
## glm(formula = y_train ~ ., family = binomial(logit), data = X_train_subset_df)
##
## Coefficients: (3 not defined because of singularities)
## Estimate
## (Intercept) 2.695e+00
## `(Intercept)` NA
## year2 NA
## month2August 2.708e-01
## month2December 7.675e-02
## month2February -4.976e-02
## month2January 9.834e-02
## month2July 7.722e-02
## month2June -3.918e-02
## month2March -7.988e-02
## month2May -3.146e-02
## month2November 6.207e-02
## month2October 2.395e-01
## month2September 1.126e-01
## day2Monday -1.663e-01
## day2Saturday -4.335e-02
## day2Sunday -2.478e-01
## day2Thursday -1.692e-01
## day2Tuesday -9.276e-02
## day2Wednesday -1.510e-01
## `stop_was_initiatedBased on Radio Run` -1.516e-01
## `stop_was_initiatedBased on Self Initiated` -2.832e-02
## issuing_officer_rankDI -1.436e+00
## issuing_officer_rankDT2 -1.510e+01
## issuing_officer_rankDT3 -1.384e+00
## issuing_officer_rankDTS -1.695e+00
## issuing_officer_rankINS -1.360e+01
## issuing_officer_rankLSA -2.212e+00
## issuing_officer_rankLT -1.759e+00
## issuing_officer_rankPO -1.494e+00
## issuing_officer_rankPOF -1.656e+00
## issuing_officer_rankPOM -1.443e+00
## issuing_officer_rankSDS -1.036e+01
## issuing_officer_rankSGT -1.838e+00
## issuing_officer_rankSSA -2.003e+00
## supervising_officer_rankDI 5.873e-01
## supervising_officer_rankDT3 -1.280e+01
## supervising_officer_rankDTS NA
## supervising_officer_rankINS -5.569e-01
## supervising_officer_rankLCD -1.181e+01
## supervising_officer_rankLSA -7.559e-01
## supervising_officer_rankLT 3.481e-01
## supervising_officer_rankPO 5.823e-01
## supervising_officer_rankPOF -1.322e+01
## supervising_officer_rankPOM 7.778e-01
## supervising_officer_rankSDS 1.891e-02
## supervising_officer_rankSGT 1.780e-01
## supervising_officer_rankSSA 3.453e-01
## `suspected_crime_descriptionAUTO STRIPPIG` 4.082e-01
## suspected_crime_descriptionBURGLARY 1.125e-01
## suspected_crime_descriptionCPSP 4.838e-01
## suspected_crime_descriptionCPW -8.343e-01
## `suspected_crime_descriptionCRIMINAL MISCHIEF` 5.214e-01
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE` 4.631e-01
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT` -1.356e+01
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE` -8.840e-01
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA` -5.538e-01
## `suspected_crime_descriptionCRIMINAL TRESPASS` -9.974e-02
## `suspected_crime_descriptionFORCIBLE TOUCHING` 2.026e+00
## `suspected_crime_descriptionGRAND LARCENY` 3.899e-01
## `suspected_crime_descriptionGRAND LARCENY AUTO` 2.671e-01
## `suspected_crime_descriptionMAKING GRAFFITI` 1.289e+00
## suspected_crime_descriptionMENACING 1.262e-01
## suspected_crime_descriptionMURDER -8.698e-01
## suspected_crime_descriptionOTHER -2.690e-01
## `suspected_crime_descriptionPETIT LARCENY` 1.019e+00
## suspected_crime_descriptionRAPE -1.000e+00
## `suspected_crime_descriptionRECKLESS ENDANGERMENT` -4.255e-01
## suspected_crime_descriptionROBBERY 3.692e-01
## suspected_crime_descriptionTERRORISM -1.312e+01
## `suspected_crime_descriptionTHEFT OF SERVICES` -1.332e+00
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE` 5.994e-01
## officer_explained_stop_flag -1.733e+00
## other_person_stopped_flag -2.295e-01
## officer_in_uniform_flag 4.070e-01
## ask_for_consent_flg -2.484e-01
## consent_given_flg -3.630e-01
## backround_circumstances_violent_crime_flag 3.352e-01
## backround_circumstances_suspect_known_to_carry_weapon_flag 1.872e-01
## suspects_actions_casing_flag -1.738e-01
## suspects_actions_concealed_possession_weapon_flag 3.961e-01
## suspects_actions_decription_flag 6.588e-01
## suspects_actions_drug_transactions_flag 9.950e-01
## suspects_actions_identify_crime_pattern_flag -2.014e-01
## suspects_actions_lookout_flag -2.838e-01
## suspects_actions_other_flag -2.753e-02
## suspects_actions_proximity_to_scene_flag 1.812e-01
## suspect_reported_age 5.810e-02
## suspect_sexMALE -1.557e-01
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` -1.431e+00
## suspect_race_descriptionBLACK -1.128e+00
## `suspect_race_descriptionBLACK HISPANIC` -1.180e+00
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` -1.519e+00
## suspect_race_descriptionWHITE -1.491e+00
## `suspect_race_descriptionWHITE HISPANIC` -1.122e+00
## suspect_height 2.114e-02
## suspect_weight -4.767e-02
## suspect_body_build_typeMED -8.344e-02
## suspect_body_build_typeTHN -6.150e-02
## suspect_body_build_typeU 1.783e-01
## suspect_body_build_typeXXX -6.865e-01
## suspect_hair_colorBLK -1.630e-01
## suspect_hair_colorBLN 3.811e-01
## suspect_hair_colorBRO -2.195e-02
## suspect_hair_colorGRN -1.304e+01
## suspect_hair_colorGRY 4.727e-02
## suspect_hair_colorORG -9.198e-01
## suspect_hair_colorPLE -1.455e+01
## suspect_hair_colorPNK -7.190e-01
## suspect_hair_colorRED 2.816e-01
## suspect_hair_colorSDY -2.746e-02
## suspect_hair_colorWHI -2.421e-01
## suspect_hair_colorXXX -2.415e+00
## suspect_hair_colorZZZ -5.053e-02
## stop_location_boro_nameBROOKLYN 6.163e-01
## stop_location_boro_nameMANHATTAN 3.447e-01
## stop_location_boro_nameQUEENS 6.456e-01
## `stop_location_boro_nameSTATEN ISLAND` 5.853e-01
## time_of_dayAfternoon 9.326e-02
## time_of_dayEvening -8.312e-02
## time_of_dayNight 1.002e-01
## pop_ba_above_est 1.959e-05
## pop_inpov_est -1.014e-06
## pop_asian_est 1.671e-05
## pop_hisp_est 4.007e-06
## pop_black_est 2.420e-05
## Std. Error
## (Intercept) 1.267e+00
## `(Intercept)` NA
## year2 NA
## month2August 1.333e-01
## month2December 1.530e-01
## month2February 1.350e-01
## month2January 1.288e-01
## month2July 1.317e-01
## month2June 1.382e-01
## month2March 1.316e-01
## month2May 1.301e-01
## month2November 1.473e-01
## month2October 1.337e-01
## month2September 1.445e-01
## day2Monday 1.076e-01
## day2Saturday 9.836e-02
## day2Sunday 1.031e-01
## day2Thursday 9.793e-02
## day2Tuesday 9.713e-02
## day2Wednesday 9.711e-02
## `stop_was_initiatedBased on Radio Run` 8.753e-02
## `stop_was_initiatedBased on Self Initiated` 1.111e-01
## issuing_officer_rankDI 1.552e+00
## issuing_officer_rankDT2 4.208e+02
## issuing_officer_rankDT3 1.050e+00
## issuing_officer_rankDTS 1.022e+00
## issuing_officer_rankINS 8.827e+02
## issuing_officer_rankLSA 1.580e+00
## issuing_officer_rankLT 1.005e+00
## issuing_officer_rankPO 1.010e+00
## issuing_officer_rankPOF 1.014e+00
## issuing_officer_rankPOM 1.008e+00
## issuing_officer_rankSDS 5.460e+02
## issuing_officer_rankSGT 1.024e+00
## issuing_officer_rankSSA 1.165e+00
## supervising_officer_rankDI 9.931e-01
## supervising_officer_rankDT3 5.599e+02
## supervising_officer_rankDTS NA
## supervising_officer_rankINS 1.551e+00
## supervising_officer_rankLCD 5.460e+02
## supervising_officer_rankLSA 7.189e-01
## supervising_officer_rankLT 4.723e-01
## supervising_officer_rankPO 1.056e+00
## supervising_officer_rankPOF 8.827e+02
## supervising_officer_rankPOM 7.724e-01
## supervising_officer_rankSDS 8.409e-01
## supervising_officer_rankSGT 4.730e-01
## supervising_officer_rankSSA 4.999e-01
## `suspected_crime_descriptionAUTO STRIPPIG` 4.453e-01
## suspected_crime_descriptionBURGLARY 1.392e-01
## suspected_crime_descriptionCPSP 3.127e-01
## suspected_crime_descriptionCPW 1.136e-01
## `suspected_crime_descriptionCRIMINAL MISCHIEF` 2.205e-01
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE` 5.726e-01
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT` 4.978e+02
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE` 5.900e-01
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA` 1.303e+00
## `suspected_crime_descriptionCRIMINAL TRESPASS` 2.548e-01
## `suspected_crime_descriptionFORCIBLE TOUCHING` 8.459e-01
## `suspected_crime_descriptionGRAND LARCENY` 1.609e-01
## `suspected_crime_descriptionGRAND LARCENY AUTO` 1.621e-01
## `suspected_crime_descriptionMAKING GRAFFITI` 4.301e-01
## suspected_crime_descriptionMENACING 1.857e-01
## suspected_crime_descriptionMURDER 5.098e-01
## suspected_crime_descriptionOTHER 1.714e-01
## `suspected_crime_descriptionPETIT LARCENY` 1.330e-01
## suspected_crime_descriptionRAPE 1.121e+00
## `suspected_crime_descriptionRECKLESS ENDANGERMENT` 2.817e-01
## suspected_crime_descriptionROBBERY 1.165e-01
## suspected_crime_descriptionTERRORISM 5.051e+02
## `suspected_crime_descriptionTHEFT OF SERVICES` 1.068e+00
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE` 3.795e-01
## officer_explained_stop_flag 1.028e-01
## other_person_stopped_flag 6.371e-02
## officer_in_uniform_flag 2.392e-01
## ask_for_consent_flg 1.247e-01
## consent_given_flg 1.161e-01
## backround_circumstances_violent_crime_flag 7.035e-02
## backround_circumstances_suspect_known_to_carry_weapon_flag 1.615e-01
## suspects_actions_casing_flag 1.681e-01
## suspects_actions_concealed_possession_weapon_flag 9.794e-02
## suspects_actions_decription_flag 8.109e-02
## suspects_actions_drug_transactions_flag 4.930e-01
## suspects_actions_identify_crime_pattern_flag 3.597e-01
## suspects_actions_lookout_flag 2.734e-01
## suspects_actions_other_flag 8.052e-02
## suspects_actions_proximity_to_scene_flag 6.724e-02
## suspect_reported_age 3.286e-02
## suspect_sexMALE 1.117e-01
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` 7.654e-01
## suspect_race_descriptionBLACK 7.418e-01
## `suspect_race_descriptionBLACK HISPANIC` 7.461e-01
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` 7.958e-01
## suspect_race_descriptionWHITE 7.496e-01
## `suspect_race_descriptionWHITE HISPANIC` 7.433e-01
## suspect_height 2.814e-02
## suspect_weight 3.822e-02
## suspect_body_build_typeMED 1.187e-01
## suspect_body_build_typeTHN 1.312e-01
## suspect_body_build_typeU 2.011e-01
## suspect_body_build_typeXXX 6.022e-01
## suspect_hair_colorBLK 1.381e-01
## suspect_hair_colorBLN 2.710e-01
## suspect_hair_colorBRO 1.592e-01
## suspect_hair_colorGRN 8.827e+02
## suspect_hair_colorGRY 2.381e-01
## suspect_hair_colorORG 9.594e-01
## suspect_hair_colorPLE 5.168e+02
## suspect_hair_colorPNK 9.668e-01
## suspect_hair_colorRED 4.078e-01
## suspect_hair_colorSDY 8.006e-01
## suspect_hair_colorWHI 5.646e-01
## suspect_hair_colorXXX 2.333e-01
## suspect_hair_colorZZZ 4.073e-01
## stop_location_boro_nameBROOKLYN 8.740e-02
## stop_location_boro_nameMANHATTAN 9.700e-02
## stop_location_boro_nameQUEENS 1.019e-01
## `stop_location_boro_nameSTATEN ISLAND` 1.514e-01
## time_of_dayAfternoon 1.084e-01
## time_of_dayEvening 1.047e-01
## time_of_dayNight 1.101e-01
## pop_ba_above_est 2.780e-05
## pop_inpov_est 4.915e-05
## pop_asian_est 4.369e-05
## pop_hisp_est 2.565e-05
## pop_black_est 2.044e-05
## z value
## (Intercept) 2.128
## `(Intercept)` NA
## year2 NA
## month2August 2.031
## month2December 0.502
## month2February -0.369
## month2January 0.764
## month2July 0.586
## month2June -0.283
## month2March -0.607
## month2May -0.242
## month2November 0.421
## month2October 1.792
## month2September 0.779
## day2Monday -1.546
## day2Saturday -0.441
## day2Sunday -2.403
## day2Thursday -1.728
## day2Tuesday -0.955
## day2Wednesday -1.555
## `stop_was_initiatedBased on Radio Run` -1.732
## `stop_was_initiatedBased on Self Initiated` -0.255
## issuing_officer_rankDI -0.925
## issuing_officer_rankDT2 -0.036
## issuing_officer_rankDT3 -1.319
## issuing_officer_rankDTS -1.657
## issuing_officer_rankINS -0.015
## issuing_officer_rankLSA -1.400
## issuing_officer_rankLT -1.750
## issuing_officer_rankPO -1.479
## issuing_officer_rankPOF -1.633
## issuing_officer_rankPOM -1.431
## issuing_officer_rankSDS -0.019
## issuing_officer_rankSGT -1.795
## issuing_officer_rankSSA -1.719
## supervising_officer_rankDI 0.591
## supervising_officer_rankDT3 -0.023
## supervising_officer_rankDTS NA
## supervising_officer_rankINS -0.359
## supervising_officer_rankLCD -0.022
## supervising_officer_rankLSA -1.051
## supervising_officer_rankLT 0.737
## supervising_officer_rankPO 0.551
## supervising_officer_rankPOF -0.015
## supervising_officer_rankPOM 1.007
## supervising_officer_rankSDS 0.022
## supervising_officer_rankSGT 0.376
## supervising_officer_rankSSA 0.691
## `suspected_crime_descriptionAUTO STRIPPIG` 0.917
## suspected_crime_descriptionBURGLARY 0.809
## suspected_crime_descriptionCPSP 1.547
## suspected_crime_descriptionCPW -7.346
## `suspected_crime_descriptionCRIMINAL MISCHIEF` 2.365
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE` 0.809
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT` -0.027
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE` -1.498
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA` -0.425
## `suspected_crime_descriptionCRIMINAL TRESPASS` -0.392
## `suspected_crime_descriptionFORCIBLE TOUCHING` 2.395
## `suspected_crime_descriptionGRAND LARCENY` 2.423
## `suspected_crime_descriptionGRAND LARCENY AUTO` 1.647
## `suspected_crime_descriptionMAKING GRAFFITI` 2.998
## suspected_crime_descriptionMENACING 0.680
## suspected_crime_descriptionMURDER -1.706
## suspected_crime_descriptionOTHER -1.570
## `suspected_crime_descriptionPETIT LARCENY` 7.666
## suspected_crime_descriptionRAPE -0.892
## `suspected_crime_descriptionRECKLESS ENDANGERMENT` -1.511
## suspected_crime_descriptionROBBERY 3.170
## suspected_crime_descriptionTERRORISM -0.026
## `suspected_crime_descriptionTHEFT OF SERVICES` -1.247
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE` 1.580
## officer_explained_stop_flag -16.854
## other_person_stopped_flag -3.603
## officer_in_uniform_flag 1.702
## ask_for_consent_flg -1.991
## consent_given_flg -3.126
## backround_circumstances_violent_crime_flag 4.764
## backround_circumstances_suspect_known_to_carry_weapon_flag 1.159
## suspects_actions_casing_flag -1.034
## suspects_actions_concealed_possession_weapon_flag 4.045
## suspects_actions_decription_flag 8.124
## suspects_actions_drug_transactions_flag 2.018
## suspects_actions_identify_crime_pattern_flag -0.560
## suspects_actions_lookout_flag -1.038
## suspects_actions_other_flag -0.342
## suspects_actions_proximity_to_scene_flag 2.695
## suspect_reported_age 1.768
## suspect_sexMALE -1.394
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` -1.870
## suspect_race_descriptionBLACK -1.521
## `suspect_race_descriptionBLACK HISPANIC` -1.581
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` -1.909
## suspect_race_descriptionWHITE -1.989
## `suspect_race_descriptionWHITE HISPANIC` -1.510
## suspect_height 0.751
## suspect_weight -1.247
## suspect_body_build_typeMED -0.703
## suspect_body_build_typeTHN -0.469
## suspect_body_build_typeU 0.887
## suspect_body_build_typeXXX -1.140
## suspect_hair_colorBLK -1.181
## suspect_hair_colorBLN 1.406
## suspect_hair_colorBRO -0.138
## suspect_hair_colorGRN -0.015
## suspect_hair_colorGRY 0.198
## suspect_hair_colorORG -0.959
## suspect_hair_colorPLE -0.028
## suspect_hair_colorPNK -0.744
## suspect_hair_colorRED 0.690
## suspect_hair_colorSDY -0.034
## suspect_hair_colorWHI -0.429
## suspect_hair_colorXXX -10.349
## suspect_hair_colorZZZ -0.124
## stop_location_boro_nameBROOKLYN 7.052
## stop_location_boro_nameMANHATTAN 3.554
## stop_location_boro_nameQUEENS 6.335
## `stop_location_boro_nameSTATEN ISLAND` 3.866
## time_of_dayAfternoon 0.860
## time_of_dayEvening -0.794
## time_of_dayNight 0.910
## pop_ba_above_est 0.705
## pop_inpov_est -0.021
## pop_asian_est 0.383
## pop_hisp_est 0.156
## pop_black_est 1.184
## Pr(>|z|)
## (Intercept) 0.033340
## `(Intercept)` NA
## year2 NA
## month2August 0.042231
## month2December 0.615817
## month2February 0.712343
## month2January 0.445137
## month2July 0.557637
## month2June 0.776845
## month2March 0.543814
## month2May 0.808873
## month2November 0.673513
## month2October 0.073172
## month2September 0.435826
## day2Monday 0.122120
## day2Saturday 0.659424
## day2Sunday 0.016276
## day2Thursday 0.083944
## day2Tuesday 0.339606
## day2Wednesday 0.120025
## `stop_was_initiatedBased on Radio Run` 0.083214
## `stop_was_initiatedBased on Self Initiated` 0.798760
## issuing_officer_rankDI 0.354749
## issuing_officer_rankDT2 0.971377
## issuing_officer_rankDT3 0.187289
## issuing_officer_rankDTS 0.097430
## issuing_officer_rankINS 0.987705
## issuing_officer_rankLSA 0.161660
## issuing_officer_rankLT 0.080066
## issuing_officer_rankPO 0.139133
## issuing_officer_rankPOF 0.102521
## issuing_officer_rankPOM 0.152489
## issuing_officer_rankSDS 0.984869
## issuing_officer_rankSGT 0.072713
## issuing_officer_rankSSA 0.085636
## supervising_officer_rankDI 0.554267
## supervising_officer_rankDT3 0.981763
## supervising_officer_rankDTS NA
## supervising_officer_rankINS 0.719519
## supervising_officer_rankLCD 0.982748
## supervising_officer_rankLSA 0.293043
## supervising_officer_rankLT 0.461067
## supervising_officer_rankPO 0.581517
## supervising_officer_rankPOF 0.988055
## supervising_officer_rankPOM 0.313964
## supervising_officer_rankSDS 0.982061
## supervising_officer_rankSGT 0.706718
## supervising_officer_rankSSA 0.489701
## `suspected_crime_descriptionAUTO STRIPPIG` 0.359337
## suspected_crime_descriptionBURGLARY 0.418752
## suspected_crime_descriptionCPSP 0.121759
## suspected_crime_descriptionCPW 2.04e-13
## `suspected_crime_descriptionCRIMINAL MISCHIEF` 0.018054
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE` 0.418698
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT` 0.978269
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE` 0.134019
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA` 0.670787
## `suspected_crime_descriptionCRIMINAL TRESPASS` 0.695425
## `suspected_crime_descriptionFORCIBLE TOUCHING` 0.016600
## `suspected_crime_descriptionGRAND LARCENY` 0.015384
## `suspected_crime_descriptionGRAND LARCENY AUTO` 0.099515
## `suspected_crime_descriptionMAKING GRAFFITI` 0.002720
## suspected_crime_descriptionMENACING 0.496524
## suspected_crime_descriptionMURDER 0.087969
## suspected_crime_descriptionOTHER 0.116507
## `suspected_crime_descriptionPETIT LARCENY` 1.78e-14
## suspected_crime_descriptionRAPE 0.372219
## `suspected_crime_descriptionRECKLESS ENDANGERMENT` 0.130909
## suspected_crime_descriptionROBBERY 0.001526
## suspected_crime_descriptionTERRORISM 0.979282
## `suspected_crime_descriptionTHEFT OF SERVICES` 0.212474
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE` 0.114202
## officer_explained_stop_flag < 2e-16
## other_person_stopped_flag 0.000315
## officer_in_uniform_flag 0.088810
## ask_for_consent_flg 0.046436
## consent_given_flg 0.001770
## backround_circumstances_violent_crime_flag 1.89e-06
## backround_circumstances_suspect_known_to_carry_weapon_flag 0.246349
## suspects_actions_casing_flag 0.301060
## suspects_actions_concealed_possession_weapon_flag 5.24e-05
## suspects_actions_decription_flag 4.51e-16
## suspects_actions_drug_transactions_flag 0.043555
## suspects_actions_identify_crime_pattern_flag 0.575618
## suspects_actions_lookout_flag 0.299295
## suspects_actions_other_flag 0.732407
## suspects_actions_proximity_to_scene_flag 0.007031
## suspect_reported_age 0.077058
## suspect_sexMALE 0.163202
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` 0.061464
## suspect_race_descriptionBLACK 0.128368
## `suspect_race_descriptionBLACK HISPANIC` 0.113890
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` 0.056275
## suspect_race_descriptionWHITE 0.046666
## `suspect_race_descriptionWHITE HISPANIC` 0.131037
## suspect_height 0.452443
## suspect_weight 0.212329
## suspect_body_build_typeMED 0.481914
## suspect_body_build_typeTHN 0.639328
## suspect_body_build_typeU 0.375203
## suspect_body_build_typeXXX 0.254271
## suspect_hair_colorBLK 0.237706
## suspect_hair_colorBLN 0.159680
## suspect_hair_colorBRO 0.890326
## suspect_hair_colorGRN 0.988212
## suspect_hair_colorGRY 0.842659
## suspect_hair_colorORG 0.337691
## suspect_hair_colorPLE 0.977536
## suspect_hair_colorPNK 0.457024
## suspect_hair_colorRED 0.489901
## suspect_hair_colorSDY 0.972638
## suspect_hair_colorWHI 0.668141
## suspect_hair_colorXXX < 2e-16
## suspect_hair_colorZZZ 0.901261
## stop_location_boro_nameBROOKLYN 1.77e-12
## stop_location_boro_nameMANHATTAN 0.000379
## stop_location_boro_nameQUEENS 2.38e-10
## `stop_location_boro_nameSTATEN ISLAND` 0.000111
## time_of_dayAfternoon 0.389559
## time_of_dayEvening 0.427351
## time_of_dayNight 0.362790
## pop_ba_above_est 0.481070
## pop_inpov_est 0.983546
## pop_asian_est 0.702048
## pop_hisp_est 0.875854
## pop_black_est 0.236350
##
## (Intercept) *
## `(Intercept)`
## year2
## month2August *
## month2December
## month2February
## month2January
## month2July
## month2June
## month2March
## month2May
## month2November
## month2October .
## month2September
## day2Monday
## day2Saturday
## day2Sunday *
## day2Thursday .
## day2Tuesday
## day2Wednesday
## `stop_was_initiatedBased on Radio Run` .
## `stop_was_initiatedBased on Self Initiated`
## issuing_officer_rankDI
## issuing_officer_rankDT2
## issuing_officer_rankDT3
## issuing_officer_rankDTS .
## issuing_officer_rankINS
## issuing_officer_rankLSA
## issuing_officer_rankLT .
## issuing_officer_rankPO
## issuing_officer_rankPOF
## issuing_officer_rankPOM
## issuing_officer_rankSDS
## issuing_officer_rankSGT .
## issuing_officer_rankSSA .
## supervising_officer_rankDI
## supervising_officer_rankDT3
## supervising_officer_rankDTS
## supervising_officer_rankINS
## supervising_officer_rankLCD
## supervising_officer_rankLSA
## supervising_officer_rankLT
## supervising_officer_rankPO
## supervising_officer_rankPOF
## supervising_officer_rankPOM
## supervising_officer_rankSDS
## supervising_officer_rankSGT
## supervising_officer_rankSSA
## `suspected_crime_descriptionAUTO STRIPPIG`
## suspected_crime_descriptionBURGLARY
## suspected_crime_descriptionCPSP
## suspected_crime_descriptionCPW ***
## `suspected_crime_descriptionCRIMINAL MISCHIEF` *
## `suspected_crime_descriptionCRIMINAL POSSESSION OF CONTROLLED SUBSTANCE`
## `suspected_crime_descriptionCRIMINAL POSSESSION OF FORGED INSTRUMENT`
## `suspected_crime_descriptionCRIMINAL SALE OF CONTROLLED SUBSTANCE`
## `suspected_crime_descriptionCRIMINAL SALE OF MARIHUANA`
## `suspected_crime_descriptionCRIMINAL TRESPASS`
## `suspected_crime_descriptionFORCIBLE TOUCHING` *
## `suspected_crime_descriptionGRAND LARCENY` *
## `suspected_crime_descriptionGRAND LARCENY AUTO` .
## `suspected_crime_descriptionMAKING GRAFFITI` **
## suspected_crime_descriptionMENACING
## suspected_crime_descriptionMURDER .
## suspected_crime_descriptionOTHER
## `suspected_crime_descriptionPETIT LARCENY` ***
## suspected_crime_descriptionRAPE
## `suspected_crime_descriptionRECKLESS ENDANGERMENT`
## suspected_crime_descriptionROBBERY **
## suspected_crime_descriptionTERRORISM
## `suspected_crime_descriptionTHEFT OF SERVICES`
## `suspected_crime_descriptionUNAUTHORIZED USE OF A VEHICLE`
## officer_explained_stop_flag ***
## other_person_stopped_flag ***
## officer_in_uniform_flag .
## ask_for_consent_flg *
## consent_given_flg **
## backround_circumstances_violent_crime_flag ***
## backround_circumstances_suspect_known_to_carry_weapon_flag
## suspects_actions_casing_flag
## suspects_actions_concealed_possession_weapon_flag ***
## suspects_actions_decription_flag ***
## suspects_actions_drug_transactions_flag *
## suspects_actions_identify_crime_pattern_flag
## suspects_actions_lookout_flag
## suspects_actions_other_flag
## suspects_actions_proximity_to_scene_flag **
## suspect_reported_age .
## suspect_sexMALE
## `suspect_race_descriptionASIAN / PACIFIC ISLANDER` .
## suspect_race_descriptionBLACK
## `suspect_race_descriptionBLACK HISPANIC`
## `suspect_race_descriptionMIDDLE EASTERN/SOUTHWEST ASIAN` .
## suspect_race_descriptionWHITE *
## `suspect_race_descriptionWHITE HISPANIC`
## suspect_height
## suspect_weight
## suspect_body_build_typeMED
## suspect_body_build_typeTHN
## suspect_body_build_typeU
## suspect_body_build_typeXXX
## suspect_hair_colorBLK
## suspect_hair_colorBLN
## suspect_hair_colorBRO
## suspect_hair_colorGRN
## suspect_hair_colorGRY
## suspect_hair_colorORG
## suspect_hair_colorPLE
## suspect_hair_colorPNK
## suspect_hair_colorRED
## suspect_hair_colorSDY
## suspect_hair_colorWHI
## suspect_hair_colorXXX ***
## suspect_hair_colorZZZ
## stop_location_boro_nameBROOKLYN ***
## stop_location_boro_nameMANHATTAN ***
## stop_location_boro_nameQUEENS ***
## `stop_location_boro_nameSTATEN ISLAND` ***
## time_of_dayAfternoon
## time_of_dayEvening
## time_of_dayNight
## pop_ba_above_est
## pop_inpov_est
## pop_asian_est
## pop_hisp_est
## pop_black_est
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10436.2 on 8425 degrees of freedom
## Residual deviance: 8481.8 on 8303 degrees of freedom
## AIC: 8727.8
##
## Number of Fisher Scoring iterations: 13
predict_all <- predict(logit_all, newdata = X_test_df, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
predict_sub <- predict(logit_sub, newdata = X_test_subset_df, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
library(pROC)
# Compute ROC and AUC for the model with all features
roc_all <- roc(y_test, predict_all)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_all <- round(auc(roc_all), 4)
cat("AUC for the model with all features:", auc_all, "\n")
## AUC for the model with all features: 0.947
# Compute ROC and AUC for the model with subset of features
roc_sub <- roc(y_test, predict_sub)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_sub <- round(auc(roc_sub), 4)
cat("AUC for the model with subset features:", auc_sub, "\n")
## AUC for the model with subset features: 0.7698
# Function to compute and plot combined ROC curves
plot_combined_roc <- function(full_roc, subset_roc, full_label, subset_label) {
# Plot the full model ROC curve
plot(y = full_roc$sensitivities, x = 1 - full_roc$specificities, type = 'l',
col = 'blue', lwd = 2, xlab = 'False Positive Rate', ylab = 'True Positive Rate',
main = 'ROC Curves for Logistic Regression Models')
# Add the subset model ROC curve
lines(y = subset_roc$sensitivities, x = 1 - subset_roc$specificities, col = 'green', lwd = 2)
# Add random guess line
abline(a = 0, b = 1, lty = 2, col = 'gray')
# Add grid
grid()
# Add legend
legend('bottomright',
legend = c(sprintf("%s AUC: %.3f", full_label, auc(full_roc)),
sprintf("%s AUC: %.3f", subset_label, auc(subset_roc)),
'Random Guess'),
lty = c(1, 1, 2), lwd = 2, col = c('blue', 'green', 'gray'))
# Add bounding box
box()
}
# Call the function to plot the combined ROC curves
plot_combined_roc(roc_all, roc_sub, "Full Model", "Subset Model")
# Define a function to generate and plot confusion matrix
generate_cm <- function(true, predicted, title) {
# Create a confusion matrix as a data frame
cm <- as.data.frame(table(True = true, Predicted = predicted)) %>%
group_by(Predicted) %>%
mutate(Predicted_pct = Freq / sum(Freq))
# Print the confusion matrix
print(cm)
# Plot the confusion matrix
plot <- ggplot(data = cm, mapping = aes(x = ordered(True, c(1, 0)), y = Predicted, fill = Predicted_pct)) +
geom_tile() +
geom_text(aes(label = round(Predicted_pct, 2)), color = 'white') +
scale_fill_gradient(low = "blue", high = "red", name = "Rel. Freq.") +
xlab("True") +
ylab("Predicted") +
labs(title = title) +
theme_minimal()
# Print the plot
print(plot)
}
# Convert probabilities to binary predictions using a threshold of 0.5
predict_all_binary <- ifelse(predict_all > 0.5, 1, 0)
predict_sub_binary <- ifelse(predict_sub > 0.5, 1, 0)
# Plot confusion matrix for logistic regression with all variables
generate_cm(y_test, predict_all_binary, "Confusion Matrix for Logistic Regression (Full Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2346 0.913
## 2 1 0 224 0.0872
## 3 0 1 108 0.104
## 4 1 1 932 0.896
# Plot confusion matrix for logistic regression with subset of variables
generate_cm(y_test, predict_sub_binary, "Confusion Matrix for Logistic Regression (Subset Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2176 0.765
## 2 1 0 669 0.235
## 3 0 1 278 0.363
## 4 1 1 487 0.637
insert formula
# run lasso on training data to collect coefficients
lasso <- cv.glmnet(x=X_train, y=y_train, alpha = 1, family="binomial", type.measure = "class")
# optimal lambda
lasso_lambda_min <- lasso$lambda.min
cat("Optimal Lambda for Lasso:", lasso_lambda_min, "\n")
## Optimal Lambda for Lasso: 9.970342e-05
# plot misclassification error against log lambda
plot(lasso)
title(main = "Cross-Validation Misclassification Error",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)",
ylab = "Misclassification Error")
# plot coefficients
plot(lasso$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for LASSO",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)",
ylab = "Coefficients")
# get fitted probabilities, using best lambda
lasso_predict <- predict(lasso, s = lasso_lambda_min, X_test, type = "response")
# add variable for predicted classes
lasso_y_hat <- ifelse(lasso_predict > 0.5, 1, 0)
# define function to generate and plot confusion matrix
generate_cm <- function(true, predicted, title) {
cm <- as.data.frame(table(True = true, Predicted = predicted)) %>%
group_by(Predicted) %>%
mutate(Predicted_pct = Freq / sum(Freq))
print(cm)
plot <- ggplot(data = cm, mapping = aes(x = ordered(True, c(1, 0)), y = Predicted, fill = Predicted_pct)) +
geom_tile() +
geom_text(aes(label = round(Predicted_pct, 2)), color = 'white') +
scale_fill_gradient(low = "blue", high = "red", name = "Rel. Freq.") +
xlab("True") +
ylab("Predicted") +
labs(title = title) +
theme_minimal()
# print the plot
print(plot)
}
# plot confusion matrix for lasso regression out of sample
generate_cm(y_test, lasso_y_hat, "Confusion Matrix for LASSO (Full Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2347 0.913
## 2 1 0 225 0.0875
## 3 0 1 107 0.103
## 4 1 1 931 0.897
# compute ROC
lasso_roc_full <- roc(response = y_test, predictor = lasso_predict)
## Setting levels: control = 0, case = 1
## Warning in roc.default(response = y_test, predictor = lasso_predict):
## Deprecated use a matrix as predictor. Unexpected results may be produced,
## please pass a numeric vector.
## Setting direction: controls < cases
# plot ROC
Note - Lucas Some issues with the graphics What is the benefit of doing the subset?
# run lasso on training data to collect coefficients
lasso_subset <- cv.glmnet(x=X_train_subset, y=y_train, alpha = 1, family="binomial", type.measure = "class")
lasso_lambda_min_subset <- lasso_subset$lambda.min
# plot misclassification error against log lambda
plot(lasso_subset)
title(main = "Cross-Validation Misclassification Error",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)",
ylab = "Misclassification Error")
# plot coefficients
plot(lasso_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for LASSO",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)",
ylab = "Coefficients")
# get fitted probabilities
lasso_predict_subset <- predict(lasso_subset, s = lasso_lambda_min_subset, X_test_subset, type = "response")
# add variable for predicted classes
lasso_y_hat_subset <- ifelse(lasso_predict_subset > 0.5, 1, 0)
# plot confusion matrix for lasso subset regression out of sample
generate_cm(y_test, lasso_y_hat_subset, "Confusion Matrix for LASSO (Subset Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2240 0.756
## 2 1 0 724 0.244
## 3 0 1 214 0.331
## 4 1 1 432 0.669
# new function to compute and plot combined roc for full and subset models
plot_combined_roc <- function(full_roc, subset_roc, full_label, subset_label) {
# plot the full model roc curve
plot(y = full_roc$sensitivities, x = 1 - full_roc$specificities, type = 'l',
col = 'blue', lwd = 2, xlab = 'False Positive Rate', ylab = 'True Positive Rate',
main = 'ROC Curves')
# add the subset model roc curve
lines(y = subset_roc$sensitivities, x = 1 - subset_roc$specificities, col = 'green', lwd = 2)
# add random guess line
abline(a = 0, b = 1, lty = 2, col = 'gray')
# add grid
grid()
# add legend
legend('bottomright',
legend = c(sprintf("%s AUC: %.3f", full_label, auc(full_roc)),
sprintf("%s AUC: %.3f", subset_label, auc(subset_roc)),
'Random Guess'),
lty = c(1, 1, 2), lwd = 2, col = c('blue', 'green', 'gray'))
# add bounding box
box()
}
# compute roc object for subset model
lasso_roc_subset <- roc(response = y_test, predictor = lasso_predict_subset)
## Setting levels: control = 0, case = 1
## Warning in roc.default(response = y_test, predictor = lasso_predict_subset):
## Deprecated use a matrix as predictor. Unexpected results may be produced,
## please pass a numeric vector.
## Setting direction: controls < cases
# plot combined roc curve for lasso models
plot_combined_roc(lasso_roc_full, lasso_roc_subset, "Lasso Full", "Lasso Subset")
# run ridge regression on training data to collect coefficients
ridge <- cv.glmnet(x = X_train, y = y_train, alpha = 0, family = "binomial", type.measure = "class")
ridge_lambda_min <- ridge$lambda.min
cat("optimal lambda for ridge:", ridge_lambda_min, "\n")
## optimal lambda for ridge: 0.03932508
# plot misclassification error against log lambda
plot(ridge)
title(main = "cross-validation misclassification error (ridge)",
sub = "optimal lambda highlighted in red",
xlab = "log(lambda)",
ylab = "misclassification error")
# plot coefficients
plot(ridge$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "coefficient shrinkage path for ridge",
sub = "optimal lambda highlighted in red",
xlab = "log(lambda)",
ylab = "coefficients")
# get fitted probabilities
ridge_predict <- predict(ridge, s = ridge_lambda_min, X_test, type = "response")
nrow(ridge_predict)
## [1] 3610
# add variable for predicted classes
ridge_y_hat <- ifelse(ridge_predict > 0.5, 1, 0)
# plot confusion matrix for ridge regression out of sample
generate_cm(y_test, ridge_y_hat, "Confusion Matrix for Ridge (Full Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2348 0.907
## 2 1 0 241 0.0931
## 3 0 1 106 0.104
## 4 1 1 915 0.896
# compute roc object for full ridge model
ridge_roc_full <- roc(response = y_test, predictor = ridge_predict)
## Setting levels: control = 0, case = 1
## Warning in roc.default(response = y_test, predictor = ridge_predict):
## Deprecated use a matrix as predictor. Unexpected results may be produced,
## please pass a numeric vector.
## Setting direction: controls < cases
# run ridge regression on subset predictors
ridge_subset <- cv.glmnet(x = X_train_subset, y = y_train, alpha = 0, family = "binomial", type.measure = "class")
ridge_lambda_min_subset <- ridge_subset$lambda.min
cat("Optimal lambda for ridge (subset):", ridge_lambda_min_subset, "\n")
## Optimal lambda for ridge (subset): 0.1018653
# plot misclassification error against log lambda
plot(ridge_subset)
title(main = "Cross-Validation Misclassification Error (Ridge - Subset)",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)",
ylab = "Misclassification Error")
# plot coefficients
plot(ridge_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for Ridge (Subset)",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)",
ylab = "Coefficients")
# get fitted probabilities
ridge_predict_subset <- predict(ridge_subset, s = ridge_lambda_min_subset, X_test_subset, type = "response")
# add variable for predicted classes
ridge_y_hat_subset <- ifelse(ridge_predict_subset > 0.5, 1, 0)
# plot confusion matrix for ridge subset regression out of sample
generate_cm(y_test, ridge_y_hat_subset, "Confusion Matrix for Ridge (Subset Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2290 0.748
## 2 1 0 771 0.252
## 3 0 1 164 0.299
## 4 1 1 385 0.701
# compute roc object for subset ridge model
ridge_roc_subset <- roc(response = y_test, predictor = ridge_predict_subset)
## Setting levels: control = 0, case = 1
## Warning in roc.default(response = y_test, predictor = ridge_predict_subset):
## Deprecated use a matrix as predictor. Unexpected results may be produced,
## please pass a numeric vector.
## Setting direction: controls < cases
# plot combined roc curve for ridge models
plot_combined_roc(ridge_roc_full, ridge_roc_subset, "Ridge Full", "Ridge Subset")
insert formula
note no optimization of alpha here, yet Note - Lucas This is where we could use the basis of the earlier en regression I did. I.e. initialise an en with alpha of 0 and then increase alpha and replace it in the model only if it improves accuracy - to discuss note here - add training accuracy measures? to compare
# run elastic net regression on training data to collect coefficients
en <- cv.glmnet(x = X_train, y = y_train, alpha = 0.5, family = "binomial", type.measure = "class")
en_lambda_min <- en$lambda.min
cat("optimal lambda for elastic net:", en_lambda_min, "\n")
## optimal lambda for elastic net: 0.0001655511
# plot misclassification error against log lambda
plot(en)
title(main = "cross-validation misclassification error (elastic net)",
sub = "optimal lambda highlighted in red",
xlab = "log(lambda)",
ylab = "misclassification error")
# plot coefficients
plot(en$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "coefficient shrinkage path for elastic net",
sub = "optimal lambda highlighted in red",
xlab = "log(lambda)",
ylab = "coefficients")
# get fitted probabilities
en_predict <- predict(en, s = en_lambda_min, X_test, type = "response")
nrow(en_predict)
## [1] 3610
# add variable for predicted classes
en_y_hat <- ifelse(en_predict > 0.5, 1, 0)
# plot confusion matrix for en regression out of sample
generate_cm(y_test, en_y_hat, "Confusion Matrix for Elastic Net (Full Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2347 0.913
## 2 1 0 225 0.0875
## 3 0 1 107 0.103
## 4 1 1 931 0.897
# compute roc object for subset ridge model
en_roc_full <- roc(response = y_test, predictor = en_predict)
## Setting levels: control = 0, case = 1
## Warning in roc.default(response = y_test, predictor = en_predict): Deprecated
## use a matrix as predictor. Unexpected results may be produced, please pass a
## numeric vector.
## Setting direction: controls < cases
Potential Alternative to optimising over alpha
#alpha_values <- seq(0, 1, by = 0.01) # Range of alpha to test
#en_opt_best_alpha <- NULL
#en_opt_best_auc <- -Inf # Initialize with the lowest possible AUC
#en_opt_best_model <- NULL
#en_opt_best_lambda <- NULL
# Iterate over alpha values
#for (alpha in alpha_values) {
# Train Elastic Net model for the given alpha
#en_opt <- cv.glmnet(X_train, y_train, alpha = alpha, family = "binomial", type.measure = "class")
# Get the best lambda from cross-validation
#en_opt_lambda_min <- en_opt$lambda.min
# Predict probabilities on the validation set
#en_opt_y_prob <- predict(en_opt, s = en_opt_lambda_min, newx = X_test, type = "response")
# Compute AUC
#en_opt_auc_value <- auc(response = y_test, predictor = en_opt_y_prob)
# Update best alpha if AUC improves
#if (en_opt_auc_value > best_auc) {
#en_opt_best_auc <- en_opt_auc_value
#en_opt_best_alpha <- alpha
#en_opt_best_model <- en_opt_auc_value
#en_opt_best_lambda <- en_opt_lambda_min
#}
# Print progress
#cat(sprintf("Alpha: %.2f, AUC: %.5f\n", alpha, en_opt_auc_value))
#}
# Output best alpha and its AUC
#cat(sprintf("Best Alpha: %.2f\n", en_opt_best_alpha))
#cat(sprintf("Highest AUC: %.5f\n", en_opt_best_auc))
# Best model is stored in `en_opt_best_model` with lambda = `en_opt_best_lambda`
# run elastic net regression on subset training data to collect coefficients
en_subset <- cv.glmnet(x = X_train_subset, y = y_train, alpha = 0.5, family = "binomial", type.measure = "class")
# get optimal lambda for elastic net subset
en_lambda_min_subset <- en_subset$lambda.min
cat("Optimal Lambda for Elastic Net (Subset):", en_lambda_min_subset, "\n")
## Optimal Lambda for Elastic Net (Subset): 0.00636804
# plot misclassification error against log lambda for subset
plot(en_subset)
title(main = "Cross-Validation Misclassification Error (Elastic Net - Subset)",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)",
ylab = "Misclassification Error")
# plot coefficient shrinkage for subset
plot(en_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path (Elastic Net - Subset)",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)",
ylab = "Coefficients")
# get fitted probabilities for subset
en_predict_subset <- predict(en_subset, s = en_lambda_min_subset, X_test_subset, type = "response")
# add variable for predicted classes for subset
en_y_hat_subset <- ifelse(en_predict_subset > 0.5, 1, 0)
# plot confusion matrix for en_subset regression out of sample
generate_cm(y_test, en_y_hat_subset, "Confusion Matrix for Elastic Net (Subset Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2227 0.759
## 2 1 0 709 0.241
## 3 0 1 227 0.337
## 4 1 1 447 0.663
# compute roc object for subset ridge model
en_roc_subset <- roc(response = y_test, predictor = en_predict_subset)
## Setting levels: control = 0, case = 1
## Warning in roc.default(response = y_test, predictor = en_predict_subset):
## Deprecated use a matrix as predictor. Unexpected results may be produced,
## please pass a numeric vector.
## Setting direction: controls < cases
# plot combined roc curve for ridge models
plot_combined_roc(en_roc_full, en_roc_subset, "Elastic Net Full", "Elastic Net Subset")
# insert most important coef from the models here
Next, we implement the random forest algorithm [justification].
We do this on the subset of the data as.
We tune the maximum number features of the tree first.
To do this, we split our current training data set into a training set and validation set.
# subset training data for validation split
n_train <- nrow(X_train)
id_split <- sample(1:n_train, floor(0.5 * n_train))
# split into training and validation sets
X_train_subset_final <- X_train_subset[id_split, ]
y_train_final <- y_train[id_split]
X_val_subset <- X_train_subset[-id_split, ]
y_val <- y_train[-id_split]
# print dimensions
cat("dimensions of X_train_final:", dim(X_train_subset_final), "\n")
## dimensions of X_train_final: 4213 125
cat("dimensions of y_train_final:", length(y_train_final), "\n")
## dimensions of y_train_final: 4213
cat("dimensions of X_val:", dim(X_val_subset), "\n")
## dimensions of X_val: 4213 125
cat("dimensions of y_val:", length(y_val), "\n")
## dimensions of y_val: 4213
# random forest tuning parameters
ntree <- 100 # can optimize over thislater
# we also use the default for bootstrap sample size, i.e nrow(X_train_final)
# initialize performance storage
train_acc <- test_acc <- oob_acc <- ncol(X_train_subset_final)
# tune random forest for max_features
for (max_features in 1:ncol(X_train_subset_final)) { # setting this < ncol(X_train_subset_final for now, just to check if it runs)
# fit the rf model using training data
rf <- randomForest(x = X_train_subset_final, y = factor(y_train_final), mtry = max_features, ntree = ntree)
# use trained model to predict y with validation x
y_pred_val <- predict(rf, X_val_subset)
# use trained model to predict y with training x
y_pred_train <- rf$predicted
# compute OOB accuracy measure - OOB error rate
oob_acc[max_features] <- 1 - rf$err.rate[ntree, "OOB"]
# compute training accuracy
train_acc[max_features] <- mean(y_train_final == as.numeric(levels(y_pred_train)[y_pred_train]))
# compute test (val) accuracy - key parameter of interest
test_acc[max_features] <- mean(y_val == as.numeric(levels(y_pred_val)[y_pred_val]))
}
# identify best max_features in validation data
m_star <- which.max(test_acc)
cat("optimal mtry (max_features):", m_star, "\n")
## optimal mtry (max_features): 15
# plot accuracy metrics
n_features <- length(train_acc) # Should be 15 based on initialization
plot(seq_len(n_features), train_acc, type = 'l', main = 'Tuning mtry for Subset',
ylab = 'Accuracy', xlab = 'Number of Features', col = 'blue')
lines(seq_len(n_features), test_acc, col = 'red')
lines(seq_len(n_features), oob_acc, lty = 2, col = 'green')
legend('bottomright', legend = c('Train Accuracy', 'Validation Accuracy', 'OOB Accuracy'),
col = c('blue', 'red', 'green'), lty = c(1, 1, 2), bty = 'n')
# train final random forest model with optimal m
RFTuned <- randomForest(x = X_train_subset_final, y = factor(y_train_final), mtry = m_star, ntree = ntree)
# predict using tuned model and test x
RFPred <- predict(RFTuned, newdata = X_test_subset)
# compute fitted probabilities with tuned model and test x
RFProb <- predict(RFTuned, newdata = X_test_subset, type = "prob")
# compute roc and auc for random forest
RF_roc <- roc(response = factor(y_test), predictor = RFProb[, 2], levels = c(0, 1))
## Setting direction: controls < cases
RF_auc <- round(auc(RF_roc), 4)
cat("Random Forest AUC (Subset):", RF_auc, "\n")
## Random Forest AUC (Subset): 0.7959
# plot roc
plot(y = RF_roc$sensitivities, x = 1 - RF_roc$specificities, type = 'l',
col = 'darkblue', lwd = 2, xlab = 'False Positive Rate', ylab = 'True Positive Rate',
main = 'ROC Curve for Random Forest (Subset)')
abline(a = 0, b = 1, lty = 2, col = 'gray')
grid()
legend('bottomright',
legend = c(sprintf("Random Forest AUC: %.3f", auc(RF_roc)), 'Random Guess'),
lty = c(1, 2), lwd = 2, col = c('darkblue', 'gray'))
box()
# generate confusion matrix
generate_cm(
true = y_test,
predicted = RFPred,
title = "Confusion Matrix for Random Forest (Subset)"
)
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2238 0.769
## 2 1 0 673 0.231
## 3 0 1 216 0.309
## 4 1 1 483 0.691
if we have computation capacity
# Extract Random Forest importance
rf_importance_subset <- data.frame(
Variable = rownames(RFTuned$importance),
Scaled_Importance = RFTuned$importance[, "MeanDecreaseGini"] / max(RFTuned$importance[, "MeanDecreaseGini"], na.rm = TRUE)
)
rf_importance_subset <- rf_importance_subset %>%
arrange(desc(Scaled_Importance))
# plot rf variable importance - top 20
ggplot(rf_importance_subset %>% arrange(desc(Scaled_Importance)) %>% slice(1:20),
aes(x = reorder(Variable, Scaled_Importance), y = Scaled_Importance)) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
coord_flip() +
labs(title = "Top 20 Random Forest Variable Importance (Subset)",
x = "Variable",
y = "Scaled Importance") +
theme_minimal()
Insert interpretation on importance
Relative to lasso etc.
make a final graph here with the ROCs for each model on the same graph, or a table of AUCs for each model, to make it easier to compare rather than going back and looking at at separate graphs etc
In the raw data, the flag’s entries are actually
"Y" and "N", but we clean this.↩︎